Typically we use activation functions to get e.g. probability distribution from softmax. However, is there a way to control the values of the weighted sum of inputs even before applying the activation functions? such as clever ways to formulate loss functions etc. Thank you!
2026-03-27 14:59:12.1774623552
How can I control the 'range' of weighted sum of inputs from a neural network?
424 Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail At
1
Yes! There are several strategies you can use to attempt to normalize the input before being sent of the a softmax function. This is often a desired behavior due to softmax's asymptotes that quickly rise with even a relatively small input. However, there are no ways where you can truly specify a range short of normalizing the inputs along each layer. What you can do is try to keep them small in proportion to keeping them accurate. This is on of the main battles of correctly fitting a neural network.
There are two main approaches you can take to do this. I believe the prior is much more of what you're looking for.
1. Input Control
Controlling input to a network involves two things. Controlling the values of the inputs themselves, and controlling the weights that mutate those inputs.
a) Normalize Your Input
You are going to want to use some sort of normalization on your initial input to scale your feature vectors appropriately. A standard approach is to scale the inputs to have mean 0 and a variance of 1. Also linear decorrelation/whitening/pca helps a lot.
b) Use a regularization term in your loss function.
When adding a regularization term in your loss function, you can allow back prop to train for keeping the weights of your network small as to control the value going into the activation function. There are two main regularization terms for the kernel (weights).
$l1$ regularization:
$new\_loss\_f = old\_loss\_f + kernel$
and $l2$ regularization:
$new\_loss\_f = old\_loss\_f + \frac{1}{2}kernel^2$
For example, cross entropy loss with l2 regularization becomes:
$$\mathcal{L}(X, Y) = -\frac{1}{n} \sum_{i=1}^n y^{(i)} \ln a(x^{(i)}) + \left(1 - y^{(i)}\right) \ln \left(1 - a(x^{(i)})\right) + \frac{1}{2}K^2$$
Thus, your loss function accounts for the size of the kernel, or weights, and punishes them being larger. In optimization (back-prop), the weights will be adjusted accordingly to of course mitigate this additional term.
2. Batch Normalization
This is a less common technique, but some choose to use batch normalization to control the inputs to the hidden layers. This is mostly used to speed up training, but does have theoretical control over the inputs to different hidden layers.