Using gradient descent to estimate the parameters of a multiple linear regression model

It is often the case that when you have data, you would like to create a model of that data for predictive purposes using a multiple linear regression model. In such endeavor, the main challenge is to find the weights. There are many approaches for estimating the weights. In this blog, I am using the method of gradient descent to estimate the weights using Java and Scala. The Java code will be used to show non-parallelized gradient descent, while the Scala code will be used to show parallelized gradient descent (in Spark).

A multiple linear regression model may be written as

y = b + \mathbf{w}'\mathbf{x}


  • b is the intercept,
  • \mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_n \end{bmatrix} is a vector of parameters (weights),
  • \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} is a vector of inputs, and
  • y is a predicted value (scalar).

Gradient descent is an iterative method that attempts to estimate the weights through a cost function. In the case of multiple linear regression, the cost function is defined as follows.

C(w) = \frac{1}{N} \sum_{i=1}^{N} (y_i - (\mathbf{w}'\mathbf{x_i} + b))^2

Note that I have simplified w = b, w_1, w_2, \ldots, w_n so read C(w) = C(b, w_1, w_2, \ldots, w_n). The better the parameters, the lower the output value of the cost function (e.g. zero means no error in prediction). In gradient descent, we would like to start at a random point (e.g. random values for b, w_1, w_2, \ldots, w_n), evaluate C(w), and walk along its surface towards the lowest output value of C(w). To do this, we need to compute the gradient of the cost function. The gradient acts as a compass and helps us by telling the direction to move towards to the lowest point, hence, gradient descent. The gradient of the cost function are computed simply through partial derivatives.

  • \frac{\partial C}{\partial b} = \frac{2}{N} \sum_{i} -(y_i - (\mathbf{w}'\mathbf{x_i} + b))
  • \frac{\partial C}{\partial w_i} = \frac{2}{N} \sum_{i} -x_i(y_i - (\mathbf{w}'\mathbf{x_i} + b))

The updates to the intercept and weights are as follows.

  • b = b - \alpha \nabla C(w)
  • w_i = w_i - \alpha \nabla C(w)

Note the \alpha term; it is called the learning weight and is responsible for how big of a step we take in each iteration through the gradient descent. How quickly we converge depends on and is highly sensitive to the learning rate. Also note the notation \nabla C(w), which represents the corresponding gradient or partial derivative of the weight we are estimating.

The code to using the gradient descent method to estimate the parameters of a multiple linear regression model is available at There is Java code that demonstrates this approach in a non-parallelized way, as well as Scala code that demonstrates this approach in a parallelized way on Spark. The latter assumes you have access to a Spark cluster, if you do not, you can stand up your own locally.

Thov kom sawv daws noj qab nyob zoo nawb. ການອ່ານ ມີຄວາມສຸກ