# Using gradient descent to estimate the parameters of a multiple linear regression model

It is often the case that when you have data, you would like to create a model of that data for predictive purposes using a multiple linear regression model. In such endeavor, the main challenge is to find the weights. There are many approaches for estimating the weights. In this blog, I am using the method of gradient descent to estimate the weights using Java and Scala. The Java code will be used to show non-parallelized gradient descent, while the Scala code will be used to show parallelized gradient descent (in Spark).

A multiple linear regression model may be written as

$y = b + \mathbf{w}'\mathbf{x}$

where

• $b$ is the intercept,
• $\mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_n \end{bmatrix}$ is a vector of parameters (weights),
• $\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}$ is a vector of inputs, and
• $y$ is a predicted value (scalar).

Gradient descent is an iterative method that attempts to estimate the weights through a cost function. In the case of multiple linear regression, the cost function is defined as follows.

$C(w) = \frac{1}{N} \sum_{i=1}^{N} (y_i - (\mathbf{w}'\mathbf{x_i} + b))^2$

Note that I have simplified $w = b, w_1, w_2, \ldots, w_n$ so read $C(w) = C(b, w_1, w_2, \ldots, w_n)$. The better the parameters, the lower the output value of the cost function (e.g. zero means no error in prediction). In gradient descent, we would like to start at a random point (e.g. random values for $b, w_1, w_2, \ldots, w_n$), evaluate $C(w)$, and walk along its surface towards the lowest output value of $C(w)$. To do this, we need to compute the gradient of the cost function. The gradient acts as a compass and helps us by telling the direction to move towards to the lowest point, hence, gradient descent. The gradient of the cost function are computed simply through partial derivatives.

• $\frac{\partial C}{\partial b} = \frac{2}{N} \sum_{i} -(y_i - (\mathbf{w}'\mathbf{x_i} + b))$
• $\frac{\partial C}{\partial w_i} = \frac{2}{N} \sum_{i} -x_i(y_i - (\mathbf{w}'\mathbf{x_i} + b))$

The updates to the intercept and weights are as follows.

• $b = b - \alpha \nabla C(w)$
• $w_i = w_i - \alpha \nabla C(w)$

Note the $\alpha$ term; it is called the learning weight and is responsible for how big of a step we take in each iteration through the gradient descent. How quickly we converge depends on and is highly sensitive to the learning rate. Also note the notation $\nabla C(w)$, which represents the corresponding gradient or partial derivative of the weight we are estimating.

The code to using the gradient descent method to estimate the parameters of a multiple linear regression model is available at https://github.com/vangj/gradient-descent-regression. There is Java code that demonstrates this approach in a non-parallelized way, as well as Scala code that demonstrates this approach in a parallelized way on Spark. The latter assumes you have access to a Spark cluster, if you do not, you can stand up your own locally.

Thov kom sawv daws noj qab nyob zoo nawb. ການອ່ານ ມີຄວາມສຸກ