With the data from Codecademy University, we want to predict whether each student will pass their final exam. And the first step to making that prediction is to predict the probability of each student passing. Why not use a Linear Regression model for the prediction, you might ask? Let’s give it a try.
Recall that in Linear Regression, we fit a regression line of the following form to the data:
yis the value we are trying to predict
b_0is the intercept of the regression line
b_nare the coefficients of the features
x_nof the regression line
For our data points
y is either
1 (passing), or
0 (failing), and we have one feature,
num_hours_studied. Below we fit a Linear Regression model to our data and plotted the results, with the line of best fit in red.
A problem quickly arises. For low values of
num_hours_studied the regression line predicts negative probabilities of passing, and for high values of
num_hours_studied the regression line predicts probabilities of passing greater than
1. These probabilities are meaningless! We get these meaningless probabilities since the output of a Linear Regression model ranges from -∞ to +∞.
Provided to you is the code to train a linear regression model on the Codecademy University data and plot the regression line. Run the code and observe the plot. Expand the plot to fullscreen for a larger view.
Using the regression line, estimate the probability of passing for a student who studies
1 hour and for a student who studies
19 hours. Save the results to
What is wrong with using a Linear Regression model to predict these probabilities?