Learn

Binary categorical variables are variables with exactly two possible values. In a regression model, these two values are generally coded as 1 or 0. For example, a multiple regression equation from the survey dataset might look like this:

score=32.7+8.5hours_studied+22.5breakfast\text{score} = 32.7 + 8.5*\text{hours\_studied} + 22.5* \text{breakfast}

breakfast is a binary categorical predictor with two possible values: “ate breakfast,” which is coded as 1 in the model and “didn’t eat breakfast,” which is coded as 0. If we substitute these values for breakfast in the regression equation, we end up with two equations: one for each group.

For breakfast eaters, we substitute 1 for breakfast and simplify:

score=32.7+8.5hours_studied+22.51score=32.7+8.5hours_studied+22.5score=(32.7+22.5)+8.5hours_studiedscore=55.2+8.5hours_studied\begin{aligned} \text{score} = 32.7 + 8.5*\text{hours\_studied} + 22.5*\bm{1}& \\ \text{score} = 32.7 + 8.5*\text{hours\_studied} + 22.5& \\ \text{score} = (32.7 + 22.5) + 8.5*\text{hours\_studied}& \\ \text{score} = 55.2 + 8.5*\text{hours\_studied}& \\ \end{aligned}

For the group that didn’t eat breakfast, we substitute 0 for breakfast and simplify:

score=32.7+8.5hours_studied+22.50score=32.7+8.5hours_studied+0score=32.7+8.5hours_studied\begin{aligned} \text{score} = 32.7 + 8.5*\text{hours\_studied} + 22.5*\bm{0}& \\ \text{score} = 32.7 + 8.5*\text{hours\_studied} + 0& \\ \text{score} = 32.7 + 8.5*\text{hours\_studied}& \\ \end{aligned}

If we inspect these two equations, we see that the only difference is the larger intercept for the group that ate breakfast (55.2) compared to the group that didn’t eat breakfast (32.7). The coefficient on hours_studied is the same for both groups.

We can visualize this regression equation by adding both lines to the scatter plot of score and hours_studied with plt.plot() as follows:

import seaborn as sns import matplotlib.pyplot as plt sns.lmplot(x='hours_studied', y='score', hue='breakfast', markers=['o', 'x'], fit_reg=False, data=survey) plt.plot(survey.hours_studied, 42.2+8.7*survey.hours_studied, color='blue',linewidth=5) plt.plot(survey.hours_studied, 49.7+8.7*survey.hours_studied, color='orange',linewidth=5) plt.show()

Plot showing hours studied on the x-axis and score on the y-axis. Two parallel regression lines run in a positive direction over the scatter plot: the line for the group that didn't eat breakfast starts at a lower intercept than the line for the group that did eat breakfast.

From the plot, we can see the regression lines have the same slope. The orange line for the breakfast-eaters starts higher, but increases at the same rate as the blue line for the group that didn’t eat breakfast.

Instructions

1.

Code has been provided for you in script.py to fit a regression model predicting port3 based on math1 and address. The fitted model has been saved as model1. Use .params to print the intercept and coefficients from the results and inspect the coefficient for address.

2.

The variable address has two values: R for rural (coded as address = 0 in the model) and U for urban (coded as address = 1). Because we’ve included this binary variable in our model, we’ve actually fit two separate regression lines: one for students who live at a rural address, and one for students who live at an urban address.

Using the output from the model, write out the regression equation for when address is equal to R and save the value of the intercept as interceptR. Then, write out the regression equation for when address is equal to U and save the value of the intercept as interceptU. Finally, since the slope on math1 will be the same for both equations, save this value as slope. Round all final values to one decimal place (i.e., the tenth’s place).

3.

The code for the scatter plot of port3 and math1 has been provided for you in script.py. Using the regression equations you created in the last checkpoint, add a blue line to the scatter plot for rural addresses.

4.

Using the regression equations with rounded values that you created in the second step, add an orange line to the scatter plot for urban addresses. What’s similar about the two lines you just plotted? What’s different?

Take this course for free

By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Already have an account?