Understanding maths in adanet paper

71 Views Asked by At

I am reading up on adanet from here

In the network architecture section the author defines it as

let $l$ denote the number of intermediate layers in the network and $n_k$ the maximum number of units in layer $k \in [l]$. Each unit $j \in [n_k]$ in layer $k$ represents a function denoted by $h_{k,j}$ (before composition with an activation function). let $X$ denote the input space and for any $x \in X$, let $\Psi(x) \in R^{n_{0}}$ denote the corresponding feature vector. Then the family of functions defined by the first layer functions $h_{1,j}, j \in [n_1]$, is the following

$$H_1 = \left\{x \mapsto \displaystyle\sum_{s=1}^{k-1}{\bf{u}}.\Psi(x): {\bf{u}} \in \mathbb{R}^{n_{0}}, {||{{\bf{u}}}||}_p \leq \Lambda_{1,0}\right\} \tag1$$

Hence, $H_1$ is a set of all the linear mappings of input feature vector

Where $p \geq 1$ defines an $l_p$ norm and $\Lambda_{1,0} \geq 0$ is a hyperparameter on the weights connecting layer 0 and layer 1. The family of functions $h_{k,j}, j\in [n_k]$, in a higher layer $k > 1$ is then defined as follows

$$H_k = \left\{x \mapsto \displaystyle\sum_{s=1}^{k-1}{\bf{u}_s}.(\psi_s \circ {\bf{h}_s})(x): {\bf{u}}_s \in \mathbb{R}^{n_{s}}, {||{{\bf{u}}_s}||}_p \leq \Lambda_{k,s}, h_{k,s} \in H_s \right\}\tag2$$

$ \circ$ denotes a element wise composition with a Non-linearity. ReLU, sigmoid etc. $\Lambda$ defines how sparse the network would be. Basically regularization

Hence, $H_k$ is a set of all the linear mappings of all the feature maps (vector of outputs) produced by all previous layers

This basically defines an architecture where $H_k$ is the $k^{th}$ layer and is connected to every layer that comes before the $k^{th}$ layer

Now the final definition. output unit of the network is defined as a function $$\displaystyle{\sum_{k=1}^{l}\sum_{j=1}^{n_k}}w_{k,j}h_{k,j} = \sum_{k=1}^{l}{\bf{w}}_k{\bf{h}}_k \tag 3$$ where $\bf{h_k}$ is the output vector of $k_{th}$ layer and ${\bf{w_k}}\in\mathbb{R}^{n_k}$ is the vector of connecting weights to units of layer $k$

Then the author goes on to define $\mathcal{F}$ the family of functions defined by equation (3) with the absolute value of the weights summing to one: $$\mathcal{F} = \left\{\displaystyle\sum_{k=1}^{l}{\bf{w_k}}.{\bf{h_k}}:h_k \in {H_{k}^{{n}_{k}}}, \displaystyle \sum_{k=1}^l||W_k|| = 1 \right\}$$

Let $\widetilde{H_k}$ denote the union of $H_k$ and its reflection, $\widetilde{H_k}= H_k \bigcup (-H_k)$ and let $H$ denote the union of families $\widetilde{H_k}: H = \bigcup_{k=1}^l\widetilde{H_k}$. Then, $\mathcal{F}$ coincides with the convex hull of $H: \mathcal{F} = conv(H)$.

I do not understand how

  1. $\mathcal{F}$ coincides with the convex hull of $H: \mathcal{F} = conv(H)$. Is $H_k$ not already its own reflection since it is the set of all possible linear mappings. So why is not $\widetilde{H_k}$ = $H_k$ = $-H_k$?
  2. This mathematical representation of the network architecture here allows for biases to be included in various layers