With the data from Codecademy University, we want to predict whether each student will pass their final exam. And the first step to making that prediction is to predict the probability of each student passing. Why not use a Linear Regression model for the prediction, you might ask? Let’s give it a try.
Recall that in Linear Regression, we fit a regression line of the following form to the data:
where
y
is the value we are trying to predictb_0
is the intercept of the regression lineb_1
,b_2
, …b_n
are the coefficients of the featuresx_1
,x_2
, …x_n
of the regression line
For our data points y
is either 1
(passing), or 0
(failing), and we have one feature, num_hours_studied
. Below we fit a Linear Regression model to our data and plotted the results, with the line of best fit in red.

A problem quickly arises. For low values of num_hours_studied
the regression line predicts negative probabilities of passing, and for high values of num_hours_studied
the regression line predicts probabilities of passing greater than 1
. These probabilities are meaningless! We get these meaningless probabilities since the output of a Linear Regression model ranges from -∞ to +∞.
Instructions
Provided to you is the code to train a linear regression model on the Codecademy University data and plot the regression line. Run the code and observe the plot. Expand the plot to fullscreen for a larger view.
Using the regression line, estimate the probability of passing for a student who studies 1
hour and for a student who studies 19
hours. Save the results to slacker
and studious
, respectively.
What is wrong with using a Linear Regression model to predict these probabilities?