Binary categorical variables are variables with exactly two possible values. In a regression model, these two values are generally coded as 1 or 0. For example, a multiple regression equation from the `survey`

dataset might look like this:

`$\text{score} = 32.7 + 8.5*\text{hours\_studied} + 22.5* \text{breakfast}$`

`breakfast`

is a binary categorical predictor with two possible values: “ate breakfast,” which is coded as `1`

in the model and “didn’t eat breakfast,” which is coded as `0`

. If we substitute these values for `breakfast`

in the regression equation, we end up with two equations: one for each group.

For breakfast eaters, we substitute 1 for `breakfast`

and simplify:

```
$\begin{aligned}
\text{score} = 32.7 + 8.5*\text{hours\_studied} + 22.5*\bm{1}& \\
\text{score} = 32.7 + 8.5*\text{hours\_studied} + 22.5& \\
\text{score} = (32.7 + 22.5) + 8.5*\text{hours\_studied}& \\
\text{score} = 55.2 + 8.5*\text{hours\_studied}& \\
\end{aligned}$
```

For the group that didn’t eat breakfast, we substitute 0 for `breakfast`

and simplify:

```
$\begin{aligned}
\text{score} = 32.7 + 8.5*\text{hours\_studied} + 22.5*\bm{0}& \\
\text{score} = 32.7 + 8.5*\text{hours\_studied} + 0& \\
\text{score} = 32.7 + 8.5*\text{hours\_studied}& \\
\end{aligned}$
```

If we inspect these two equations, we see that the only difference is the larger intercept for the group that ate breakfast (55.2) compared to the group that didn’t eat breakfast (32.7). The coefficient on `hours_studied`

is the same for both groups.

We can visualize this regression equation by adding both lines to the scatter plot of `score`

and `hours_studied`

with `plt.plot()`

as follows:

import seaborn as sns import matplotlib.pyplot as plt sns.lmplot(x='hours_studied', y='score', hue='breakfast', markers=['o', 'x'], fit_reg=False, data=survey) plt.plot(survey.hours_studied, 42.2+8.7*survey.hours_studied, color='blue',linewidth=5) plt.plot(survey.hours_studied, 49.7+8.7*survey.hours_studied, color='orange',linewidth=5) plt.show()

From the plot, we can see the regression lines have the same slope. The orange line for the breakfast-eaters starts higher, but increases at the same rate as the blue line for the group that didn’t eat breakfast.

### Instructions

**1.**

Code has been provided for you in **script.py** to fit a regression model predicting `port3`

based on `math1`

and `address`

. The fitted model has been saved as `model1`

. Use `.params`

to print the intercept and coefficients from the results and inspect the coefficient for `address`

.

**2.**

The variable `address`

has two values: `R`

for rural (coded as `address = 0`

in the model) and `U`

for urban (coded as `address = 1`

). Because we’ve included this binary variable in our model, we’ve actually fit two separate regression lines: one for students who live at a rural address, and one for students who live at an urban address.

Using the output from the model, write out the regression equation for when `address`

is equal to `R`

and save the value of the intercept as `interceptR`

. Then, write out the regression equation for when `address`

is equal to `U`

and save the value of the intercept as `interceptU`

. Finally, since the slope on `math1`

will be the same for both equations, save this value as `slope`

. Round all final values to one decimal place (i.e., the tenth’s place).

**3.**

The code for the scatter plot of `port3`

and `math1`

has been provided for you in **script.py**. Using the regression equations you created in the last checkpoint, add a blue line to the scatter plot for rural addresses.

**4.**

Using the regression equations with rounded values that you created in the second step, add an orange line to the scatter plot for urban addresses. What’s similar about the two lines you just plotted? What’s different?

# Take this course for free

By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.