Why does my regression-NN completely fail to predict some points?

50 Views Asked by At

I would like to train a NN in order to approximate an unknown function $y = f(x_1,x_2)$. I have a lot of measurements $y = [y_1,\dots,y_K]$ (with K that could be in the range of 10-100 thousands) which come out from either a simulation or a measurement of a system. I've built a feed-forward NN for solving this problem by using a MSE loss-function, i.e.

$$\mathcal{L} = \frac{1}{K}\sum_{i=1}^K(y_i-\hat{y}_i)^2$$

where I defined as $\hat{y}$ the prediction of the NN. As per activation function I used a ReLU. The network topology is fairly simple with input layer having two neurons $(x_1,x_2)$, three hidden layers with 10 neurons each and finally the output layer (single output neuron).

After the training process expires I've obtained a very curious result. The loss function assumes very small values hence (apparently) indicating that the training is successful. However, if I analyse the squared-error point-by-point, i.e. the quantity

$$ \boldsymbol{\epsilon}_y = [(y_1-\hat{y}_1)^2,\dots,(y_K-\hat{y}_K)^2] $$ I find that for the vast majority of points said quantity is basically zero, with exception of some "outliers" where the error is huge. It looks like this event happens where the gradient of $f()$ is rather big, which might be a reasonable assumption maybe.

I would like for this to not happen anymore. I'd rather accept a slightly bigger error throughout the whole function domain then have the majority of points with null error but some local points that are completely off. As a requirement, the network topology shall be kept rather "easy" so I would not like to increase the number of layers and/or neurons per layer. As a side node, I've also tried to increase the topology complexity a little bit (i.e. 15 neurons per hidden layer and adding a 4th hidden layer) obtaining slighlty better results but still unacceptable error around the steepest points of the function.

I've got two ideas for now:

  1. Use a different loss-function $\mathcal{L}$
  2. Sample the dataset $(x_1,x_2)$ more frequently around steepest regions, and less frequently where the function is rather smooth and flat

First option

A different loss-function to be adopted. I'm not very familiar with loss-function that might help solve my problem, some rather quick research yielded no good results and I found no significant literature highlighting this kind of problem. I initially thought of something of the likes of

$$ \mathcal{L} = \frac{1}{K}\sum_{i=1}^K(y_i-\hat{y}_i)^2 + \mu\, max\{\boldsymbol{\epsilon}_y\} $$ where $\mu$ is some tuned parameter that penalizes the maximum error. I'm not sure if this makes sense and if it increases the training complexity to the extent that the training process runs in no convergence territory.

Second option

I am not sure how to formalize the undersampling where the function is smooth and flat, I imagine that some kernel used for image processing (edge detection kind of kernels of some sort?) might be helpful but I'm in completely unknown territory.

Conclusion

I'm looking for some insight and ideas (literature results or analogous cases linked are a huge plus!) for solving this curious problems.

Thanks for the help!

1

There are 1 best solutions below

0
On

I've had this problem before for predicting time-domain signals. The objective was to have as close an output time-domain signal. The initial training with MSE loss provided an output signal with many pertubations in it.

A useful try of trying to reduce pertubations in certain output errors is:

$\mathcal{L} = abs(\int_{n_0}^{n_N} F_0^N (\hat{y}) - \int_{n_0}^{n_N} F_0^N (y))$

where $F$ is the function resulting from $y$ and $\hat{y}$, for $0$ to $N$ points $n$, $y \in \mathcal{R^D}$. Following which, some scaling can be done.

This way, the network will minimise for all differences in $\hat{y} - y$ values. A linear combination of both of these loss functions can also be used.

Ultimately, the network will train for the provided loss function and minimise that loss function; that said, it will also depend on the data and how the network is able to converge for that given dataset.

If the data is extremely nonlinear, increasing the network size might help to reduce the pertubations.