Learn
Random Forests
Bagging

You might be wondering how the trees in the random forest get created. After all, right now, our algorithm for creating a decision tree is deterministic — given a training set, the same tree will be made every time.

Random forests create different trees using a process known as bagging. Every time a decision tree is made, it is created using a different subset of the points in the training set. For example, if our training set had 1000 rows in it, we could make a decision tree by picking 100 of those rows at random to build the tree. This way, every tree is different, but all trees will still be created from a portion of the training data.

One thing to note is that when we’re randomly selecting these 100 rows, we’re doing so with replacement. Picture putting all 100 rows in a bag and reaching in and grabbing one row at random. After writing down what row we picked, we put that row back in our bag.

This means that when we’re picking our 100 random rows, we could pick the same row more than once. In fact, it’s very unlikely, but all 100 randomly picked rows could all be the same row!

Because we’re picking these rows with replacement, there’s no need to shrink our bagged training set from 1000 rows to 100. We can pick 1000 rows at random, and because we can get the same row more than once, we’ll still end up with a unique data set.

Let’s implement bagging! We’ll be using the data set of cars that we used in our decision tree lesson.

Instructions

1.

Start by creating a tree using all of the data we’ve given you. Create a variable named tree and set it equal to the build_tree() function using car_data and car_labels as parameters.

Then call print_tree() using tree as a parameter. Scroll up to the top to see the root of the tree. Which feature is used to split the data at the root?

2.

For now, comment out printing the tree.

Let’s now implement bagging. The original dataset has 1000 items in it. We want to randomly select a subset of those with replacement.

Create a list named indices that contains 1000 random numbers between 0 and 1000. We’ll use this list to remember the 1000 cars and the 1000 labels that we’re going to build a tree with.

You can use either a for loop or list comprehension to make this list. To get a random number between 0 and 1000, use random.randint(0, 999).

3.

Create two new lists named data_subset and labels_subset. These two lists should contain the cars and labels found at each index in indices.

Once again, you can use either a for loop or list comprehension to make these lists.

4.

Create a tree named subset_tree using the build_tree() function with data_subset and labels_subset as parameters.

Print subset_tree using the print_tree() function.

Which feature is used to split the data at the root? Is it a different feature than the feature that split the tree that was created using all of the data?

You’ve just created a new tree from the training set! If you used 1000 different indices, you’d get another different tree. You could now create a random forest by creating multiple different trees!

Folder Icon

Sign up to start coding

Already have an account?