Maximization Minimization Algorithm: Bradley-Terry Ranking

50 Views Asked by At

With the help of exercices in my Machine Learning Class and course notes of Prof. Kenneth Lange available there, I'm trying to understand the derivation of the Maximization Minimization Algorithm with the example of Bradley-Terry Ranking:

Consider a sports league with $m$ teams. Assign team $i$ the skill level $\theta_i$. Bradley and Terry proposed the model $$ P(i\text{ beats } j) = \frac{\theta_i}{\theta_i + \theta_j} $$ To ensure that the skill levels are identifiable, we set $\theta_1= 1$. If $b_{i,j}$ is the number of times $i$ beats $j$, then the likelihood of the data is $$ L(\theta) = \prod_{i,j}\Big(\frac{\theta_i}{\theta_i + \theta_j}\Big)^{b_{i.j}} $$ Estimate $\theta$ by maximizing the log-likelihood $f(\theta) = \log L(\theta)$. Estimate the minorizing function $g(\theta|\theta^t)$ by using the supporting hyperplane. Maximize $g(\theta|\theta^t)$. \newline \newline \textbf{Solution} \newline The log-likelihood function is $$ f(\theta) = \sum_{i,j}b_{ij}(\log(\theta_i) - \log(\theta_i + \theta_j)) $$ We can use the inequality of the supporting hyperplane property $$ h(y) \ge h(x) + \nabla h(x)^\top(y-x) $$ For the convex function $h(x) = -\log(x)$ and we find \begin{align*} -\log(y) \ge -\log(x) -\frac{1}{x}(y-x) &=-\log(x) - \frac{y-x}{x}\\ &= -\log(x) - \frac{y}{x} + 1 \end{align*} Hence \begin{align*} -\log(y) &\ge -\log(x)- \frac{y}{x} + 1\\ \implies \sum_{i,j}\Big(-\log(y)\Big) &\ge\sum_{i,j}\Big( -\log(x)- \frac{y}{x} + 1\Big)\\ \implies \sum_{i,j}b_{ij}\Big(-\log(y)\Big) &\ge\sum_{i,j}b_{ij}\Big( -\log(x)- \frac{y}{x} + 1\Big) \end{align*} We can substitute $x = \theta_i^t + \theta_j^t$ with the previous estimate and $y = \theta_i + \theta_j$. We get \begin{align*} \sum_{i,j}b_{ij}\Big(-\log(\theta_i + \theta_j)\Big) &\ge\sum_{i,j}b_{ij}\Big( -\log(\theta_i^t + \theta_j^t)- \frac{\theta_i + \theta_j}{\theta_i^t + \theta_j^t} + 1\Big)\\ \underbrace{ \sum_{i,j}b_{ij}\Big(\log(\theta_i)-\log(\theta_i + \theta_j)\Big)}_{f(\theta)} &\ge\underbrace{\sum_{i,j}b_{ij}\Big(\log(\theta_i) -\log(\theta_i^t + \theta_j^t)- \frac{\theta_i + \theta_j}{\theta_i^t + \theta_j^t} + 1\Big)}_{g(\theta|\theta^t)}\tag{1} \end{align*} Notice that the objective now is separable w.r.t. the parameters $\theta_i$, so we need to maximize $$ \sum_{j \ne i}b_{ij}\log(\theta_i) - \sum_{j \ne i}\theta_i^t(b_{ij} + b_{ji})/(\theta_i^t + \theta_j^t)\tag{2} $$ w.r.t. $\theta_i$ for every $i$. We can put this in the form of $a \log (\theta_i) - \theta_i$ , where $a>0$. Because this function is concave, setting the derivative to zero (at $\theta_i = a$) yields the global optimum. The next estimate is therefore $$ \theta_i^{t + 1} = a = \frac{\sum_{j\ne i}b_{ij}}{\sum_{j\ne i}(b_{ij} + b_{ji})/(\theta_i^t + \theta_j^t)}\tag{3} $$ My questions

  • Is the derivation up to equation (1) correct?
  • What are the exact steps to go from (1) to (2)?
  • What are the exact steps to go from (2) to (3)?