Shorten and simplified explanation of a problem:
We have a sequence of variables from R of length S :
$SEQ = [x_1^{(1)} \ x_2^{(2)} \ x_4^{(3)} \ x_2^{(4)} \ x_{12}^{(5)} \ x_6^{(6)} \ x_{15}^{(7)} \ x_4^{(8)} \ x_8^{(9)} ... \ x_i^{(k)} \ ... x_{j}^{(S)}] \in R^S, \ \ i,j \in I, \ \ S \in Z$
$X = (x_1 \ x_2 \ x_3 \ x_4 \ x_5 \ x_6 ... x_I)$
Assume, we have subsequences $[x_7 \ x_2 \ x_5 ]$ and $[x_3 \ x_9 \ x_4 ]$ in our sequence, with $x_7$ close to $x_3$, and $x_5$ close to $x_4$, then we want $x_2, x_9$ to be close too. A special case: $[x_7 \ x_2 \ x_5 ]$ and $[x_7 \ x_9 \ x_5 ]$ - in this case we want $x_2, x_9$ to be really close. On the other hand, for not to paste together all variables, we need one more rule: if we had $[x_7 \ x_2 \ x_5 ]$ and $[x_7 \ x_9 \ x_5 ]$, but no $[x_7 \ x_3 \ x_5 ]$ in our sequence, then $x_3$ must be far from $x_2, x_9$ - which are close to each other.
For this purpose, we define a family of functions: $f_a : R^3 \rightarrow R$. By varying values of $x_i \in X$ and a-parameter we try to reach (see example above) $f_a(x_7 \ x_2 \ x_5 ) \approx f_a(x_7 \ x_9 \ x_5) \gt f_a(x_7 \ x_3 \ x_5) $ - then (due to continuity of $f_a$ by the second parameter) variation of $x_i$ must make $x_2, x_9$ close enough, and $x_3$ (not good, foreign for that "context") - far from them. (Note: having $x_2, x_9$ close is not the only solution for $f_a(x_7 \ x_2 \ x_5 ) \approx f_a(x_7 \ x_9 \ x_5)$, but we assume that "in average" over all variables - that must be is true).
To solve that problem for all variables simultaneously, we define $ L_i(x_{bad}) = max(0, 1 - f_a(x^{(i-1)}, x^{(i)}, x^{(i+1)}) + f_a(x^{(i-1)}, x_{bad}, x^{(i+1)})$ - if our assumption is true ($f_a$ for subsequences from our SEQ is higher than for those which are not in there - with $x_{bad}$ in the middle), then $L_i$ must be = 0 almost all the time. So, we want $L_i(x_j)$ to be = 0 as many times, as possible.
So we optimize $\sum\limits_{0 \le i \le S}\sum\limits_{j\in I} L_i(x_j) \rightarrow \min_{a, X}$ and that bust work fine, but if S and I big enough, then we have an extremely computationally expensive problem. S = $10^6$ and I = $10^4$ - are common numbers for that task.
I just had an idea of optimizing such a thing: $\sum\limits_{0 \le i \le S}\sum\limits_{0 \le j \le S} [ \ (x^{(i-1)} - x^{(j-1)})^2 + (x^{(i+1)} - x^{(j+1)})^2 - (x^{(i)} - x^{(j)})^2 \ ]^2 + \frac{\lambda}{\sum\limits_{i, j} (x^{(i)} - x^{(j)})^2} \rightarrow min_X$
That says exactly: "I want triples with close edges have close insides, but not so close in general (if all of them are too close, then penalty rises)". That might work, a?
And yes, that words somehow: Sentence: "I like cats very much I like dogs very much I like mouse very much I love mouse very much" Clustered words like this: ('much', -0.10600726003489237) ('very', -0.096242824773913105) ('cats', -0.030582320070865161) ('dogs', -0.030581881036943714) ('mouse', 0.054408944326038489) ('love', 0.057103209358647655) ('like', 0.06836113503129479) ('I', 0.083531905578858714)
But ever for such a simple sentence it takes sooo long to calculate that.
Full explanation:
I faced such an optimization problem, trying to implement neural word embedding language model (nonlinearity compress space of words basing on their contexts, but no matter, in fact).
I have:
- N = 15 000 - number of words in model
- dim = 100 - number of dimensions
- D - space of uncompressed (one-hot) words (basis vectors, like $(0 0 ... 0 1 0 ... 0 0)^T \in \{0, 1\}^{15000}$ ), |D| = 15 000
- C - space of compressed words like $(0.45 \ 0.34 \ 0.67 \ 0.23 \ \ ...)^T \in R^{dim}$
We try to build some sort of function that will simply store a compressed vector from C for every uncompressed vector from D - matrix LT of size $dim \times N$. i-th column of LT is compressed representation of i-th word $e_i = (I_{i=k} \ \vert \ 0 \le k \le N)^T \in \{0, 1\}^{15000}$ and can be computes as $LT \times e_i$ .
So, I have:
- matrix LT $\in M^{dim \times N}$ - that is 1 500 000 scalars (15 000 columns of 100 scalars) - the model itself $ LT: D \rightarrow C$ (by multiplying)
- score function $ score(c_1, c_2, c_3) : C^3 \rightarrow R \mapsto U\times tanh(W \times concat(c_1,c_2, c_3) + b) $ - neural network, in fact (concat() turns three dim-dimensional vectors into single 3dim-dimensional vector)
- a long list of words: $(e_1, ..., e_S) \mapsto S = (c_1, c_2, .., c_S) $ - their compressed version.
I have a Loss function I want to optimize:
$\sum\limits_{(c_{i-1}, c_i, c_{i+1}) \in S} \sum\limits_{c_{wrong} \in C} max(0, 1-score(c_{i-1}, c_i, c_{i+1})+score(c_{i-1}, c_{wrong}, c_{i+1}) ) \rightarrow \max_{LT, U, W, b}$
, where each $с_i = LT \times e_i$ - I want coocurrent words sequences to have higher scores.
Optimization over U, W, b - isn't a problem - it's just a regular NN, but it seems to me that optimizing the whole LT at the same time is quite (1 500 000 scalars) computationally complex :) So, could anybody give me an advice on how to, probably break this complex task into smaller ones or whatever.
The authors or the main idea say they used simple stochastic gradient to simply optimize it over LT, U, W, b but it took them more than a few weeks to get any kind of result. If there's no way to simplify that, I'm thinking of L-BFGS or something like that.