This question is motivated from paper by Cai, 2016 on joint estimation of multiple (K) precision matrices from K datasets.
Let $X^{(k)} \sim N(\mu^{(k)}, \Sigma^{(k)})$ be a p-dimensional random vector for the kth group. The precision matrix of $X^{(k)}$, denoted by $\Omega^{(k)} = (\omega_{ij}^{(k)})$ is the inverse of the covariance matrix $\Sigma_k$. Assume that $X^{(k)}$s are independent of each other. Suppose there are $n_k$ identically and independently distributed random samples from $X^{(k)}:\{X_j^{(k)}, 1 \leq j \leq n_k\}$ and $n = n_1 + \cdots + n_k$. The sample covariance matrix for each group is denoted by $\hat{\Sigma}^{(k)}$.
The goal is to simultaneously estimate the precision matrices $\Omega^{(k)}$ for $1\leq k \leq K$. The following optimization problem is proposed:
where $w_k = n_k/n$ is the weight for the kth group and $\lambda_n$ is a tuning parameter.
It is stated that the "objective function is used to encourage the sparsity of all K precision matrices", which makes sense to me since l1 penalty is imposed on each of the K precision matrices and this penalty drives entries to zero. However, the intuition behind the next statement is what I do not understand: "The constraint is imposed on the maximum of the element-wise group l2 norm to encourage the groups to share a common sparsity pattern". In particular, how does this constraint encourage a common sparsity pattern? What is the role of taking the max over (i,j) entries?
