Hello and welcome to this R tutorial here we are at the final round of regression with our final regression
model random Forest’s regression.
In a previous section we saw the decision tree regression model.
So now the decision tree regression doesn’t have any secret for you then you will perfectly understand
ran M-4s regression because the random forest is just a team of decision trees each one making some
prediction of your dependent variable and the ultimate prediction of the random first itself is simply
the average of the different predictions of all the different trees in the forest.
And actually at the end of the previous section about decision trees.
I asked you an enigma.
The Enigma was knowing the result we got.
With one tree what would be the result with ten trees or 100 trees or 500 trees in terms of visualization
and interim of prediction.
So I hope that after watching the intuition to toys made by Karylle you actually ask yourself this question
and try to predict what’s going to happen here with random first regression.
So let’s find out about that.
We are going to build a random first regression model and see what happens.
So let’s do it.
We are going to start by selecting the right father as a working directory.
So it’s in part two regression.
And here is the final regression model we are building.
Them for a regression.
So let’s go inside and that’s the rifle that we want to set as working directory with the position services
file.
So let’s click on this more button and set as a working directory.
All good.
And now let’s take our regression template to build this model efficiently.
So we are actually going to take everything from here to the bottom but we will only include this code
section to visualize the regression model results because you understood that the decision tree regression
model is a non-continuous regression model and since random forest is a combination of decision trees
then it’s a combination of non-continuous regression model and intuitively we understand.
We can guess that the Ranum for us regression will is not going to be continuous either.
So since this code doesn’t work for non continuous regression model we will actually use this one that
works perfectly for it.
So I’m going to copy this paste that here and remove this section that is non appropriate for non-continuous
regression models.
Here we go.
And now the template is ready.
Let’s change the basics.
Let’s replace here regression all by random forest regression.
Visualizing the run for US regression results and fitting random forest regression to our data set.
OK great.
So now let’s build the model which is in this section here.
So let’s remove this.
And as usual we’re going to import the right library for the job and then use a function to build our
Random forest regressors.
So the package you are going to import is called Ranum forest.
So for those of you who don’t have the package installed your packages here.
Well you can check it out.
Mine is already installed because I used it before but I’m going to write this line here.
For those of you who need to install it so install dot packages parenthesis and in quotes random so
no capital R but then capital F o r s t.
All right.
Ranum forest.
And so I’m not going to install it because my needs are in style so I’m going to put down comments.
But if you want to install it you just need to select this line as I just did and press command control
press enter to execute it.
And this will install the package properly.
But here I’m going to put in comment by pressing command plus shift Blassie.
Here we go.
And now when we have to do is to add this you know Library random forest to actually automatically select
the box here to import automatically to run him for his package when we execute the whole code or the
section.
So that’s important and now time to build the aggressor so let’s do it.
We’re going to call the aggressor regressors as usual to keep things simple and equals.
And now the function that we’re going to use is also random forest written the same.
So now let’s add some parenthesis and now the press one to have a look at the arguments.
The arguments are here and the first argument is data but as you can see it specifies that it’s an optional
data frame and we could use this argument to build our regressors but we are going to use the main arguments
to specify the independent variables on one side and the dependent variable and another side.
And to do this we are going to use these two arguments x and y.
So X will contain the matrix of features that is the independent variables and y will contain the dependent
variable vector.
That is the sorry column.
So let’s first input these two arguments to the first argument is X equals and so we have several ways
to take our independent variables.
So one of the way is to take our dataset here and then choose to write columns of the independent variables
and you know our dataset is composed of two columns the first column indexed by one which is the independent
variable column and the second column indexed by 2 which is our dependent variable column.
So here we need index 1 because we want to take the independent variable right now next argument the
next argument is why the dependent variable vector.
And now as you can see why is expected to be a response vector.
It’s actually a vector and here it expected to have a data frame.
So by using this one index and two brackets here I actually import a data from that here to get a vector
I actually need to use another trick another technique which is know to use the dollar sign and then
the name of the column which is of course salary.
And that will give me a vector.
So just to recap this syntax here will give me a data frame because we’re taking some sub data frame
of our original data from data set and here by using the dollar sign syntax here taking data said Doris
unsorry.
I’m actually taking the salary column of our data dataset but that will make it a vector and that’s
exactly what we run because the Y argument here is expecting a vector.
So we’re all good.
And now we actually need to input a third argument.
Can you guess what that is.
For those of you all the Python tutorial Well you will guess what it’s going to be.
It’s actually going to be entry the number of trees in the forest.
Well of course we’re building around them forests so it’s actually a lot better if we can choose the
number of trees that we build in our forest and it’s even better considering the fact that we’re going
to play around with different number of trees that is we’re going to start with 10 trees with a forest
of 10 trees and then you know we’ll try with a lot more than 10 trees like 100 trees or 300 trees or
500 trees.
So that’s what we’re going to input the third argument entry and we’re going to start with 10 trees
All right so let’s start with this and that’s all the arguments we need to build around forest.
We only need independent variables.
The dependent variable and the number of trees and that will already make a robust Ranum for us regression
model and then we will make it even more robust by adding more trees in the forest.
But before we continue let’s set the Ranum factors to something fixed so that we all get the same results
So you know in Python We used a random set parameter equal to zero.
Here we can do the same on or by using the set dot seed function.
And then in this function we actually put a seed and you know we can use whatever we want in Bison we
usually take zero 42 and we like to do is you know take either 1 2 3 or 1 2 3 4.
So let’s use the seed to get the same result.
And that’s what made this tutorial easier to follow if you’re coding at the same time.
So now we’re all good we’re actually all good with the whole code.
We don’t have anything to replace.
The only thing that will do now is to you know try several Rudham for us with several number of trees
and look at the visualization results and look at the prediction to see if we’re getting close to the
supposed 160 K per year salary of our new employee that is about to be hired.
So let’s do it.
Let’s execute the sections one by one.
So let’s import the data set first.
Here we go.
David said Well important we make sure we have our two columns the independent variable level and the
dependent variable salary.
Perfect.
Now no need to split the dataset into 20 sets in the test set.
No need to apply feature scaling and now time to create our first random forest.
So let’s do this let’s execute this code section here and here it is random for us.
Well created perfect.
So now it’s time to have fun.
Would you like to vizualize the result first or getting the prediction.
Well first let’s maybe visualize the results because we want to make sure we have the right model and
we want to validate it because we will try several number of trees.
Here we are starting with centuries.
So we want to see if it looks like a correct model.
So I’m going to execute this section.
Here we go and let’s see what we’ll get.
OK so first of all this looks fine.
We don’t seem to have any problem here.
The only thing that we can improve very quickly is actually you know those straight lines here.
There are supposed to be vertical and to get a better representation of this.
We just need to increase the resolution as we did for Decision Tree regression.
So let’s add on one that will be sufficient and let’s re execute this and now much better it almost
looks like it’s some vertical straight lines representing better than non-continuous.
And so now what can we say.
Let’s zoom on this plot to have a better look at.
Now that’s Interbrand.
OK.
So the answer to the enigma that I asked you in the previous section and that I was asking you again
in this tutorial is that we simply get more steps in the stairs by having several decision trees instead
of one decision tree.
We have a lot more steps in the stairs than what we had with one decision tree and therefore we have
a lot more of splits of the whole range of levels and therefore a lot more intervals of the different
levels so each straight horizontal line here separate by these vertical lines or one interval that is
one split and the fact that we get more steps in the stairs is actually quite intuitive because you
know if we get for example this prediction here for the 6.5 level Well what happened for this prediction
is that we had 10 trees voting on which step the salary of the 6.5 level position would be.
And then the Ranum for us took the average of all the different predictions of the salary of the 6.5
level made by all the different trees in the forest.
And for example if we take the fourth position level 10 votes were made each of these 10 votes correspond
to one prediction of the level for salary made by each one of those ten trees and then to run them for
us took the average of these 10 predictions and this average is nothing else than the prediction of
the level for salary made by the random forest itself.
And so we get more steps because simply the whole range of levels is split into more intervals and that
is because the random forest is calculating many different averages of its decision trees predictions
in each of these intervals.
So that’s what happened it’s quite intuitive.
However there is something important to point out here is that if we add a lot more trees in our random
forest Well it doesn’t mean we’ll get a lot more steps on the stairs because the more you add some trees
the more the average of the different predictions made by the trees is converging to the same average
You know this is based on the same technique entropy and information gain.
So the more you add trees the more the average of these votes will converge to the same ultimate average
and therefore it will converge to some certain shape of stairs here.
So that’s important to visualize this as well.
And now since we have our intuition of the visualization of the run for US regression one day.
Let’s see what happens with the prediction.
So let’s see what prediction we get.
Remember that disemployed said that it’s pretty sorry was a 160 k.
And now let’s see what’s around for us composed of 10 trees.
So let’s look at that and it says that a previous Saori was a hundred and forty one thousand dollars
That’s actually a very dangerous prediction because we are way below the 160 K Sellery that this new
entry is set to have in its previous company.
So if we trust this prediction we will actually think this employee’s bluffing but no worries we will
not stop here.
Right now we’re going to try run first with a lot more than 10 trees.
So let’s pick for example 100 trees and let’s see what we’ll get.
So I’m going to rebuild the model.
Here we go.
And now let’s look at the graphic results.
And as I was telling you we don’t get much more steps in this plot or new random for us regression.
You know we multiplied our number of trees by 10 but the number of steps was definitely not multiply
by 10.
We compare we can compare that very quickly.
This is the previous plot and this is a new one.
Ten one hundred trees we can see that we have maybe a little more steps but definitely not ten times
the previous steps.
So the reason for this the explanation is related to this convergence idea that I talked to you about
And so what changes here with 100 trees in terms of the plot is not the number of steps there was increased
but a better choice a better location of the steps and the stairs with respect to our salary access
That means that maybe the steps are better located to make our ultimate predictions of the salaries
of each of our level from 1 to 10 incremented by on one.
So to check that out we simply need to make our final prediction to predict the salary of the 6.5 level
So let’s recap the employer is saying 160 k around and random forest was dently said a hundred 41 k
And now let’s see what say around M-4s with 100 trees exited and now it says 166 K..
So much better.
We’re getting close to the supposed real salary of 160 K and besides we’re never actually on the good
side of negotiation because we will no longer think that this employee is bluffing.
So since the prediction seems to be improving as we increase the number of trees let’s actually try
with 500 trees that’s a huge forest.
We have now.
So let’s execute this to build our new huge forest of 500 trees.
Here we go.
Ufer has created.
Let’s have a quick look at the visualization plot results.
But it’s going to be the same thing we will not get a lot of more stares maybe a little more.
Well actually let’s check it out.
Well definitely not.
We seem to have the same number of steps on the stairs.
But as I was telling you each of the steps in this series might actually be better located to make each
ultimate prediction of the salaries for each of the 10 levels here.
So the best way to check that out is actually to get our ultimate prediction of the sorry of this 6.5
level.
And let’s check it out.
Let’s see if we get a better prediction than the 166 OK.
Executing.
And right on the spot we hit the bull’s eye with 160 458 predicted sorry.
So awesome job that just running for US with 500 trees just dead here because it predicted almost the
same sorry as to suppose to 160 K salary.
The disputer simply said to have and it’s Breese company and actually so far before we made this run
for us with 500 trees the best model that made the closest prediction to this 160 get salary was a polynomial
regression model and now the Ranum for us regression is beating the polynomial regression model because
now we get a prediction that is almost the same as a real value.
So right in the spot.
Congratulations.
We actually made our final model and now I just want to conclude this tutorial by making this transition
to one of our future point which is actually part 10 in part and we will build some essential machinery
model.
There is some models that are a combination of several machinery models and you know in machinery these
are actually the best models.
You know when you have a team of several machine models they can actually make an awesome prediction
because unless we have a nine time machine running model in argumentation of machinery models that is
the only model to be right.
Well you are more likely to get the correct prediction with ten machine learning models predicting the
same thing than was just one model.
So that’s actually what we did here.
Well we had a team of same machine learning models which were decision tree regression models.
But in the future we’ll make a team of different machine only models.
So that’s going to be very fun.
That’s going to be very powerful as well.
And I look forward to getting there with you.
So now I’m telling you congratulations for two things first for building this very powerful regression
model to run for US regression model and second for having build all our regression models.
We built some linear regression models some non linear regression models some non-linear are non-continuous
regression models and some non-linear or non-continuous and simple regression models.
So congratulations you’re definitely on your way to becoming some expert in machine learning.
But wait for what’s coming next.
So speaking of what’s coming next I look forward to seeing you in the next sections or next parts.
And until then enjoy mission early.