What is special about $(1-\alpha )\cdot f(x) + \alpha \cdot f(y)$?

76 Views Asked by At

I see the expression $(1-\alpha )\cdot f(x) + \alpha \cdot f(y)$ in many places:

In the definition of a concave function:

$$ {\displaystyle f((1-\alpha )x+\alpha y)\geq (1-\alpha )f(x)+\alpha f(y)}$$

In reinforcement learning:

enter image description here

which seems to also have the special notation:

enter image description here

as well as in batch normalization for neural networks:

enter image description here

and I've seen it appear in other places, and I was wondering if there's a special meaning behind it, an intuitive way to look at things or a new angle to look at things and get some insight.

2

There are 2 best solutions below

5
On BEST ANSWER

It's a cheap way of making a (weighted) convex combination of two variables. Generally speaking, the freedom of choosing $\lambda$ can bias the value toward either $f(x),f(y)$. When $\lambda = 1/2$ you get the usual average. When $\lambda$ is close to $(0,1)$ you strongly prefer one point over the other. So it's commonly used in situations where you want to balance the contributions of two terms, and you prefer to get something in between, as opposed to outside of their interval. In a sense, you're specifying how much you trust one point over the other.

For example, this is useful in machine learning and statistics, because it allows you to play with different values of $\lambda$ to see which model performs better.

As an example, elastic-net regularization can simply be of the form $\lambda \|w\|_1+(1-\lambda)\|w\|_2$, which tries to balance ($L_1,L_2$) regularization.

It's also common when moving averages are involved, which pertains to your reinforcement learning example, or say, the momentum update term in the Adam optimizer: $m_t=\lambda m_{t-1}+(1-\lambda)g_t$, where $m$ is the moving average of the gradients, and $g$ is the gradient from the current batch.

In probability, it's the easiest way of making a new distribution (called a mixture-distribution) from distributions $P,Q$, via $R(x):=\lambda P(x)+(1-\lambda)Q(x)$.

0
On

This stems from the simple knowledge that if I am on a certain side of a border of something I describe and do not want to get pass this border I may stick to numerical methods that represent this intention rigid and robust. Learning is an approximation to target functions by methods either robust and fast.

All this wishful attribute has the convex formula. It is often just limited by the problem to formulate f in a handable form. Closed-form induced more trust. Always to invert the function for evaluation takes time and memory and is a loss in trust.

I agree that convex optimization and reinforced learning based on such modelling appears very often if not nearly all the time. That has to do with the need to model in realistic well-known spaces and well as that newer methods team up with the older better known to gain trust and new application. We live in a spherical world so this is a very deep pattern in human consciousness. It is apparent that many new scientific fields are at first explored by search convex situation. Think of the black hole science both the theory and the optical methodology is convex science.

So convex is attributed to elliptical solutions with closed bounds and concave with open bounds. Hyperbolicals have both situations with open bounds. Only the linear world does have both. This static judgement can be transfered to dynamic situations. Learning in computer science is a methodology close to statics and reaching a dynamic situation for example in game theory. But finite-state chains are usually there and so such symbols are popular.

They contrast the classic convex function formula and show up with the usage of the term momentum that the methodology is second order and their functions representations are second order. While the convex formula looks in between the convex formulas from reinforced and other learning methodologies extrapolate.

It is a harder situation only available by expectations that are made to values in a computing process. All methods found popular in reinforced learning use momentum as a standardized term: ADAM, RMSProp, SGD and SignSGD. The basis is stochastic gradient descent or ordinary stochastic gradient to gain an adaptive learning rate. There is more need to get the adaptiveness than maintaining convexity like in ADAM invariance to a diagonal rescaling of the gradients.

The gradients dominate and taking the momentum into account makes the methods adaptive as a luxury predicate. Only SignSGD discards the gradients.

It is often representing thoughts about locality and globality in the considered regions of trust, learning or aiming functions. Learning in machine learning uses often standardized learning set for adaption and judgements, am I on the right way. This uncertainty in Your symbolic formulas on the left-hand side is unavoidable.

Important are the performance goals and therefore memory for the training and the calculation and the speed in measures of computation times of train and calculation. The combined goals make the reinforced learning attractive and keep everything a sure grounds.