What is the meaning of the word "block" in the Block Successive Minimization Methods (BSUM) for optimization?

43 Views Asked by At

What is the mean of the word "block" in the Block Successive Minimization Methods (BSUM) for optimization ?

https://arxiv.org/abs/1511.02746

My guess is that the word block mean act of updating both sub-problem in parallel at every iteration. But I am not sure if it correct or not ?

$Min\,\,f({x_1},{x_2}) \to \left\{ {\begin{array}{*{20}{c}} {Min\,\,f({x_1})}\\ {Min\,\,f({x_2})} \end{array}} \right.$

1

There are 1 best solutions below

3
On BEST ANSWER

You got it right. Block coordinate descent algorithms are a class of algs that do not attempt to minimize the objective function with regards to all variables at once, but rather run through variables separately at each "sub-iteration". Generally speaking their advantage is simplicity of the sub-problems, which translates to low computational cost, therefore they might be suitable to huge-scale problems and parallel computing. Their disadvantage can be slow convergence rate and convergence only to a stationary point, not a global minimum.

Notice that a "block" does not have to mean a single variable. Blocks can range from $1$ to $n$ variables, and all variables within a block will be updated simultaneously. The specific choice of block partition is dependent on the objective, constraint set, minimization procedure, etc. In your example you used single variable blocks. However, consider the function: $$f(x_1,x_2,x_3,x_4)=x_1^2x_3^2+x_2^2x_4^2$$

where it makes sense to partition to blocks as $b_1=(x_1,x_2),\,b_2=(x_3,x_4)$, since in that case the function $f$ is convex with regards to $b_1$ while keeping $b_2$ fixed (and vise versa), since there is no coupling of the variables within the blocks.

Notice that we do not always parallelize, it's an option, not a must. BSUM in specific works linearly, not in parallel. Also, "sub-iterations" do not have to be executed in a cyclic manner, there are multiple option, from cyclic to random of different distributions or a pre-defined permutation. A nice property of this class of algorithms is that usually the choice of update rule does not influence the convergence analysis, but it is required that each block is updated exectly once during a "meta-iteration". Moreover, there is empirical evidence that random permutations converge faster in practice than fixed/cyclical ones, even when the theoretical convergence is identical.