Learn

Logistic Regression

Linear Regression Approach

With the data from Codecademy University, we want to predict whether each student will pass their final exam. And the first step to making that prediction is to predict the probability of each student passing. Why not use a **Linear Regression** model for the prediction, you might ask? Let’s give it a try.

Recall that in Linear Regression, we fit a regression line of the following form to the data:

`$y = b_{0} + b_{1}x_{1} + b_{2}x_{2} +\cdots + b_{n}x_{n}$`

where

`y`

is the value we are trying to predict`b_0`

is the intercept of the regression line`b_1`

,`b_2`

, …`b_n`

are the coefficients of the features`x_1`

,`x_2`

, …`x_n`

of the regression line

For our data points `y`

is either `1`

(passing), or `0`

(failing), and we have one feature, `num_hours_studied`

. Below we fit a Linear Regression model to our data and plotted the results, with the line of best fit in red.

A problem quickly arises. For low values of `num_hours_studied`

the regression line predicts negative probabilities of passing, and for high values of `num_hours_studied`

the regression line predicts probabilities of passing greater than `1`

. These probabilities are meaningless! We get these meaningless probabilities since the output of a Linear Regression model ranges from -∞ to +∞.

Provided to you is the code to train a linear regression model on the Codecademy University data and plot the regression line. Run the code and observe the plot. Expand the plot to fullscreen for a larger view.

Using the regression line, estimate the probability of passing for a student who studies `1`

hour and for a student who studies `19`

hours. Save the results to `slacker`

and `studious`

, respectively.

What is wrong with using a Linear Regression model to predict these probabilities?