Definition

线性回归是一种基本的回归分析方法,用于描述两个变量之间的线性关系。简单线性回归模型的形式为:

其中:

  • 是因变量(响应变量)。
  • 是自变量(预测变量)。
  • 是截距(即当 的预测值)。
  • 是斜率(即 每增加一个单位时 的变化量)。

斜率 的计算公式如下:

其中:

  • 是数据点的数量。
  • 是第 个数据点的自变量和因变量的值。
  • 是自变量 的平均值。
  • 是因变量 的平均值。

解释

  • 分子部分 是协方差,表示 之间的联合变异程度。
  • 分母部分 是 ( x ) 的方差,表示 的变异程度。

计算步骤

  1. 计算 的平均值,记为
  2. 计算每个数据点与平均值的差值
  3. 计算差值的乘积和
  4. 计算 的差值平方和
  5. 用乘积和除以平方和得到斜率

示例

假设有以下数据点:

  • ( (1, 2) )
  • ( (2, 3) )
  • ( (3, 5) )
  • ( (4, 4) )
  • ( (5, 6) )

我们可以使用上述公式计算斜率

  1. 计算平均值:

  2. 计算差值:

  3. 计算差值乘积和:

  4. 计算 ( x ) 的差值平方和:

  5. 计算斜率:

因此,斜率 为 0.9。

Python 代码示例

我们也可以用 Python 代码来验证计算过程:

import numpy as np
 
# 数据点
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 3, 5, 4, 6])
 
# 计算平均值
x_mean = np.mean(x)
y_mean = np.mean(y)
 
# 计算斜率
numerator = np.sum((x - x_mean) * (y - y_mean))
denominator = np.sum((x - x_mean) ** 2)
beta_1 = numerator / denominator
 
print(f'Slope (beta_1): {beta_1}')

Residual

Most commonly, we fit a model by minimising the residual sum of squares. This means that the cost function is calculated like so:

  1. Calculate the difference between the actual and predicted values (as previously) for each data point.
  2. Square these values.
  3. Sum (or average) these squared values.

This squaring step means that not all points contribute evenly to the line: outliers—which are points that don’t fall in the expected pattern—have disproportionately larger error, which can influence the position of the line.

Strengths of regression

Regression techniques have many strengths that more complex models don’t.

Predictable and easy to interpret

Regressions are easy to interpret because they describe simple mathematical equations, which we can often graph. More complex models are often referred to as black box solutions, because it’s difficult to understand how they make predictions or how they’ll behave with certain inputs.

Easy to extrapolate

Regressions make it easy to extrapolate; to make predictions for values outside the range of our dataset. For example, it’s simple to estimate in our previous example that a nine year-old dog will have a temperature of 40.5°C. You should always apply caution to extrapolation: this model would predict that a 90 year-old would have a temperature nearly hot enough to boil water.

Optimal fitting is usually guaranteed

Most machine learning models use gradient descent to fit models, which involves tuning the gradient descent algorithm and provides no guarantee that an optimal solution will be found. By contrast, linear regression that uses the sum of squares as a cost function doesn’t actually need an iterative gradient-descent procedure. Instead, clever mathematics can be used to calculate the optimal location for the line to be placed. The mathematics are outside the scope of this module, but it’s useful to know that (so long as the sample size isn’t too large) linear regression doesn’t need special attention to be paid to the fitting process, and the optimal solution is guaranteed.

Multiple Linear Regression

Definition

Multiple Linear Regression (MLR) is a statistical technique that models the relationship between a dependent variable and two or more independent variables by fitting a linear equation to observed data. The model takes the form:

Where:

  • is the dependent variable,
  • are the independent variables,
  • is the y-intercept,
  • are the coefficients of the independent variables,
  • is the error term.

Example: Consider predicting the price of a house based on its size (in square feet), number of bedrooms, and age. The multiple linear regression model would be:

Calculation in R:

# Sample data
house_data <- data.frame(
  Price = c(200000, 250000, 300000, 150000, 180000),
  Size = c(1500, 2000, 2500, 1200, 1600),
  Bedrooms = c(3, 4, 4, 2, 3),
  Age = c(10, 15, 20, 5, 8)
)
 
# Multiple Linear Regression Model
model <- lm(Price ~ Size + Bedrooms + Age, data = house_data)
summary(model)