written 4.7 years ago by |

Linear regression is a statistical approach for modelling relationship between a dependent variable with a given set of independent variables.

Here, we refer dependent variables as response and independent variables as features for simplicity.

In order to provide a basic understanding of linear regression, we start with the most basic version of linear regression, i.e. Simple linear regression.

**Simple Linear Regression**

Simple linear regression is an approach for predicting a response using a single feature. It is assumed that the two variables are linearly related. Hence, we try to find a linear function that predicts the response value(y) as accurately as possible as a function of the feature or independent variable(x).

Let us consider a dataset where we have a value of response y for every feature x:

x | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|

y | 1 | 3 | 2 | 5 | 7 | 8 | 8 | 9 | 10 | 12 |

For generality, we define:

x as feature vector, i.e $x = [x_1, x_2, …., x_n]$,

y as response vector, i.e $y = [y_1, y_2, …., y_n]$

for n observations (in above example, n=10).

A scatter plot of above dataset looks like:-

Now, the task is to find a line which fits best in above scatter plot so that we can predict the response for any new feature values. (i.e a value of x not present in dataset)

This line is called regression line.

The equation of regression line is represented as: $Y=a+bX$

A linear regression line has an equation of the form **Y = a + bX**, where **X** is the explanatory
variable and **Y** is the dependent variable. The slope of the line is **b**, and **a** is the intercept (the
value of **y** when **x = 0**).

**The Linear Regression Equation**

Linear regression is a way to model the relationship between two variables. You might also
recognize the equation as the **slope formula**. The equation has the form Y=a+bX, where Y is the
dependent variable (that’s the variable that goes on the Y axis), X is the independent variable
(i.e. it is plotted on the X axis), b is the slope of the line and a is the y-intercept.

$a=\frac{(\sum y)(\sum x^2)-(\sum x)(\sum xy)}{n(\sum x^2)-(\sum x)^2}$

$b=\frac{n(\sum xy)-(\sum x)(\sum y)}{n(\sum x^2)-(\sum x)^2}$

The first step in finding a linear regression equation is to determine if there is a relationship between the two variables. This is often a judgment call for the researcher. You’ll also need a list of your data in x-y format (i.e. two columns of data — independent and dependent variables).

**Points to be Considered:**

Just because two variables are related, it does not mean that one causes the other. For example, although there is a relationship between high GRE scores and better performance in graduation school, it doesn’t mean that high GRE scores cause good graduation school performance.

If you attempt to try and find a linear regression equation for a set of data (especially through an automated program like Excel or a TI-83), you will find one, but it does not necessarily mean the equation is a good fit for your data. One technique is to make a scatter plot first, to see if the data roughly fits a line before you try to find a linear regression equation.

**Step 1:** Make a chart of your data, filling in the columns

$$SUBJECT$$ | $$AGE$$ $$(X)$$ | $$GLUCOSE$$ $$LEVEL$$ $$(Y)$$ | $$XY$$ | $$X^2$$ | $$Y^2$$ |
---|---|---|---|---|---|

1 | 43 | 99 | 4257 | 1849 | 9801 |

2 | 21 | 65 | 1365 | 441 | 4225 |

3 | 25 | 79 | 1975 | 625 | 6241 |

4 | 42 | 75 | 3150 | 1764 | 5625 |

5 | 57 | 87 | 4959 | 3249 | 7569 |

6 | 59 | 81 | 4779 | 3481 | 6561 |

$$\sum$$ | 247 | 486 | 20485 | 11409 | 40022 |

From the above table, Σx = 247, Σy = 486, Σxy = 20485, Σx2 = 11409, Σy2 = 40022. n is the sample size (6, in our case).

**Step 2:** Use the following equations to find a and b.

$a=\frac{(\sum y)(\sum x^2)-(\sum x)(\sum xy)}{n(\sum x^2)-(\sum x)^2}$

$b=\frac{n(\sum xy)-(\sum x)(\sum y)}{n(\sum x^2)-(\sum x)^2}$

a = **65.1416**

b = **.385225**

**Find a:**

$((486 × 11,409) – ((247 × 20,485)) / 6 (11,409) – 247^2 )$

$484979 / 7445$

$=65.14$

**Find b:**

$(6(20,485) – (247 × 486)) / (6 (11409) – 247^2 )$

$(122,910 – 120,042) / 68,454 – 247^2$

$2,868 / 7,445$

$= .385225$

**Step 3:** Insert the values into the equation.
$y'=a+bx$

$y'=65.14+.385225x$

**Regression Coefficient**

A regression coefficient is the same thing as the **slope of the line of the regression equation**.
The equation for the regression coefficient that you’ll find on the AP Statistics test is:

$B_1=b_1=\sum [(x_i-x)(y_i-y)]/\sum [(x_i-x])^2.$

“y” in this equation is the mean of y and “x” is the mean of x.