Difference between $\frac{1}{n} \sum_{i=0}^n \frac{b_i}{a_i+b_i}$ and $\frac{\sum_{i=0}^n b_i}{\sum_{i=0}^n a_i + b_i}$

91 Views Asked by At

Consider scenario like this:

  • each user can add movies.
  • each user can rate movies he added.
  • I want to know how many movies average user has unrated (as in average percentage).

Let's translate this into math. Let: $$a_i = |M_i^+| \\ b_i=|M_i^0|$$ where $M_i$ means movies watched by user $i$, $M^+$ means rated movies and $M^0$ means unrated movies. Obviously, $a_i, b_i \in \mathbb{N}$. Now, I could compute following:

$$\frac{1}{n}\sum_{i=0}^n \frac{b_i}{a_i+b_i}$$

but the point is that it requires high amount of computer resources. It is much easier to compute how many movies were unrated in relation to total movie count with no regard to specific users:

$$\frac{\sum_{i=0}^n b_i}{\sum_{i=0}^n{a_i+b_i}} = \frac{|M^0|}{|M|}$$

because these values can be very efficiently retrieved from the database, and this is what I do right now. Surprisingly, global ratio of unrated movies is the same as ratio of average user's unrated movies to average user's watched movie count $\left(\frac{\frac{1}{n}\sum_{i=0}^n b_i}{\frac{1}{n}\sum_{i=0}^n{a_i+b_i}}\right)$.

The point is that these (average percentage of unrated movies vs. global percentage of unrated movies) are not the same. For example, $\frac{1}{2}(\frac{1}{2}+\frac{2}{3}) \neq \frac{1+1}{2+3}$. However, I also discovered through simulations that for random values $a_i$ and $b_i$ such that $a_i, b_i \in (0,500)$ both sums do converge in some way (such as the difference between them gets smaller and smaller with bigger $n$).

So my question is: what's the real difference between these and how do I know the error? How do I begin to analyze that formally, other than wild guessing through random simulations? (Is this even possible, given the information about $a_i$ and $b_i$?) I'm totally lost.