Reduction of the dimensionality of unknowns vector for the particular case of gradient descent.

41 Views Asked by At

I'm currently solving the problem with parameters' optimization and encountered such issue.

Introduction:

Let's consider I have a dataset of typos and its corrections which one looks like:

original correct
sentence 1 correction of sentence 1
... ...

Due to the diversity of fields of model's application and requirement of its speed the statistical model was suggested as the solution: $$ C = \gamma\log{(P_1)} + \theta\log{(P_2)} + (1 - \theta)\log{(P_3)} + \text{feed_boost} + \text{query_boost} $$

This formula applies to each word (and its candidates obtained via Levenshtein distance) in the sentence. So

  • $P_1$ - the probability a word has typo and will be replaced by a candidate.
  • $P_2$ - the probability a word occurs within its context (left/right -side words: | context-l | word | context-r |) in the specific observation field.
  • $P_3$ - the probability a word occurs within its context in a language (as the language datasets I use Wiki texts)

Eventually, for the whole sentence the cost function looks like $\sum_{i=1}^{\# \text{words}}{C(\text{word}_i)}$.

The issue itself:

Occasionally, it requires to adjust the hyperparameters ($\gamma,\ \theta, \text{feed_boost}, \text{query_boost})$. It's pretty time complicated to perform the tests all over the combinations of all parameters in some range. Thus, I decided to implement and adjust the Gradient Descent for the particular problem. When I extracted the derivatives I obtained:

  • $\frac{\partial C}{\partial\gamma} = \log{(P_1)}$
  • $\frac{\partial C}{\partial\theta} = \log{(P_2)} - \log{(P_3)}$
  • $\frac{\partial C}{\partial\ \text{feed_boost}} = 1$
  • $\frac{\partial C}{\partial\ \text{query_boost}} = 1$

The problem arises when I compute the first two derivatives. Due to we already have the equation it's no need to compute weights and biases. This equation is suitable for the entire dataset, however the parameters are probably not the best. Let's deem the only iteration of parameters' optimization:

  1. As the input we pass the dataset (shown above) of size $N$.
  2. Apply the $\sum{C}$ and get the best sentences' candidates ($\max{\left(\left\{\sum{C}\right\}_{i=1}^{N}\right)}$).
  3. Measure some Loss metric.
  4. Compute the gradient for the Loss.
  5. Adjust the parameters

Whilst the 4-th step I'm baffling what should I do with the $P_1$, $P_2$, $P_3$. We get the best of these probabilities for every instance in the dataset during the 2-nd step. Therefore, after the 2-nd step I do obtain such vectors: $$ p_1 = \begin{bmatrix} P_{1, 1} \\ P_{1, 2} \\ \vdots \\ P_{1, N} \end{bmatrix} p_2 = \begin{bmatrix} P_{2, 1} \\ P_{2, 2} \\ \vdots \\ P_{2, N} \end{bmatrix} p_3 = \begin{bmatrix} P_{3, 1} \\ P_{3, 2} \\ \vdots \\ P_{3, N} \end{bmatrix} $$

However, in the derivatives the only number passed, not the vector. Corollary: I have to come up with a function $f: \mathbb{R}^N \rightarrow \mathbb{R}$ which one most accurately represents the vectors' values distribution. There are two ideas I have came up with:

Normalized Expectation

I was pondering regard the two expectation:

  1. Weighted: $$ f(p) = \frac{\mathbb{E}[p]}{\sigma} = \frac{\sum_{i=1}^{N}{\left(\frac{p_i^2}{\sum_{j=1}^{N}{p_j}}\right)}}{\sigma} $$ Where $\sigma$ is the standard deviation of $p$.

  2. Unweighted: $$ f(p) = \frac{\mathbb{E}[p]}{\sigma} = \frac{\frac{1}{N}\sum_{i=1}^{N}{p_i}}{\sigma} $$

Compute the derivatives for every instance and average them up:

In my mind it looks like a specific case of Jacobian matrix: $$ J\left(\left[\frac{\partial C(p_1, p_2, p_3)}{\partial \gamma}, \frac{\partial C(p_1, p_2, p_3)}{\partial \theta}, \frac{\partial C(p_1, p_2, p_3)}{\partial\ \text{feed_boost}}, \frac{\partial C(p_1, p_2, p_3)}{\partial\ \text{query_boost}}\right]\right) = J\left(\begin{bmatrix}\nabla^TC(P_{1,1}, P_{2,1}, P_{3,1})\\ \nabla^TC(P_{1,2}, P_{2,2}, P_{3,2})\\ \vdots \\ \nabla^TC(P_{1,N}, P_{2,N}, P_{3,N}) \end{bmatrix}\right) = \begin{bmatrix}\frac{\partial C(P_{1,1}, P_{2,1}, P_{3,1})}{\partial \gamma} & \frac{\partial C(P_{1,1}, P_{2,1}, P_{3,1})}{\partial \theta} & \frac{\partial C(P_{1,1}, P_{2,1}, P_{3,1})}{\partial\ \text{feed_boost}} & \frac{\partial C(P_{1,1}, P_{2,1}, P_{3,1})}{\partial\ \text{query_boost}}\\ \frac{\partial C(P_{1,2}, P_{2,2}, P_{3,2})}{\partial \gamma} & \frac{\partial C(P_{1,2}, P_{2,2}, P_{3,2})}{\partial \theta} & \frac{\partial C(P_{1,2}, P_{2,2}, P_{3,2})}{\partial\ \text{feed_boost}} & \frac{\partial C(P_{1,2}, P_{2,2}, P_{3,2})}{\partial\ \text{query_boost}}\\ \vdots & \vdots & \vdots & \vdots \\ \frac{\partial C(P_{1,N}, P_{2,N}, P_{3,N})}{\partial \gamma} & \frac{\partial C(P_{1,N}, P_{2,N}, P_{3,N})}{\partial \theta} & \frac{\partial C(P_{1,N}, P_{2,N}, P_{3,N})}{\partial\ \text{feed_boost}} & \frac{\partial C(P_{1,N}, P_{2,N}, P_{3,N})}{\partial\ \text{query_boost}} \end{bmatrix} $$

And by applying the average function we may mean the parameters:

  1. Weighted
  • $\frac{\partial C}{\partial \gamma} = f\left( \left\{J_{1,i}\right\}_{i = 1}^{N}\right) = \mathbb{E}{\left[ \left\{J_{1,i}\right\}_{i = 1}^{N}\right]} = \sum_{i=1}^{N}{\left(\frac{J_{1,i}^{2}}{\sum_{j=1}^{N}{J_{1,j}}}\right)}$
  • $\frac{\partial C}{\partial \theta} = f\left( \left\{J_{1,i}\right\}_{i = 1}^{N}\right) = \mathbb{E}{\left[ \left\{J_{2,i}\right\}_{i = 1}^{N}\right]} = \sum_{i=1}^{N}{\left(\frac{J_{2,i}^{2}}{\sum_{j=1}^{N}{J_{2,j}}}\right)}$
  • $\frac{\partial C}{\partial\ \text{feed_boost}} = 1$
  • $\frac{\partial C}{\partial\ \text{query_boost}} = 1$
  1. Unweighted
  • $\frac{\partial C}{\partial \gamma} = f\left( \left\{J_{1,i}\right\}_{i = 1}^{N}\right) = \mathbb{E}{\left[ \left\{J_{1,i}\right\}_{i = 1}^{N}\right]} = \frac{1}{N}\sum_{i=1}^{N}{J_{1,i}}$
  • $\frac{\partial C}{\partial \theta} = f\left( \left\{J_{2,i}\right\}_{i = 1}^{N}\right) = \mathbb{E}{\left[ \left\{J_{2,i}\right\}_{i = 1}^{N}\right]} = \frac{1}{N}\sum_{i=1}^{N}{J_{2,i}}$
  • $\frac{\partial C}{\partial\ \text{feed_boost}} = 1$
  • $\frac{\partial C}{\partial\ \text{query_boost}} = 1$

So the question is that, do you suppose that such $f$ looks well enough or maybe it's not demanded for the particular case. I'll be happy to see any suggestions:)