Gradient descent in $n$-dimensional space in the context of an RBF network

529 Views Asked by At

I am trying to implement an algorithm to perform gradient descent in a $n$-dimensional space in the context of an RBF network. My network has 5 inputs and 1 output. It has the following Gaussian radial basis function: $$f(x_i)=\sum_{k=1}^nw_ke^{(-\lambda_k||x_i - \mu_k||^2)}$$

where:

  • $n$ is the number of centroids in the network
  • $w_k$ are the weights between the hidden layer and output layer for each centroid $n$
  • $\lambda_k$ are the function width parameters for the $k^{th}$ centroid
  • $x_i$ is the $i^{th}$ training data input (vector)
  • $\mu_k$ is the $k^{th}$ centroid (vector)

My goal is the find optimal values of $w_k$ and $\lambda_k$ that minimize the squared error. I plan on doing that iteratively by:

  1. Fix $\lambda_k$ and use pseudo-inverse to solve for $w_k$
  2. Fix $w_k$ and solve for $\lambda_k$ using gradient descent by minimizing the error w.r.t. $\lambda_k$
  3. Repeat steps 1 and 2 until convergence is achieved (or any other criteria).

I have already found the weights by fixing each $\lambda_k$ to 1 (randomly) and using the pseudo-inverse matrix. Now I need to find values for each $\lambda_k$ that minimize the squared error. From what I understand, I need to:

  1. Find the partial(w.r.t. $\lambda_k$) derivative of the error function: $$S=\sum_{i=1}^p(y_i - f(x_i))^2$$where $f(x_i)$ is the function above, $y_i$ is the actual output from the $i^{th}$ training data and $p$ is the number of entries in the training set.
  2. Equate that partial derivative to zero.
  3. Solve for all the $\lambda_k$.

I do not understand well how to perform gradient descent w.r.t. $\lambda_k$.

Can anyone help me clarify how to perform gradient descent in a $n$-dimensional space? I need to find a vector of gradients $\delta_k$ to apply to each $\lambda_k$, so that $\lambda_k := \lambda_k + \delta_k$

My current algorithm for one iteration of gradient descent:

for each item x in training data set
   feed forward x in the network and get the output
   for each centroid k
      λk = λk - μ*(gradient of the cost function)    // μ is a constant for the learning rate
   end loop.
end loop.

where:

$\frac{\delta S}{\delta\lambda_k} = \sum_{i=1}^n2d_i^2(y_i - e^{-d_i^2\lambda_k})e^{-d_i^2\lambda_k} = \sum_{i=1}^n2d_i(y_i-f(x_i))f(x_i)$

where $d_i = ||x_i - \mu_k||$ = euclidean distance and $y_i$ = output from the training set, both known values.

Thank you very much!