Question on probability and approximation

38 Views Asked by At

Okay I think you are all familiar to YouTube videos and some facts are:

  1. to comment, like and dislike on a video you need a Google account.

  2. when someone views the video the view count of the video increases by one regardless of if the viewer has a Google account or not.

Okay, having said the above I’m wondering, how useful is the data we get via the like/dislike bar? I mean, if for e.g. you've got a video with 1 million views and the like/dislike value is 4000 to 250. Can one use that data to:

  1. say what would have been the trend of these 1 million people if all of them could had a Google account or should have had a 3rd data like the number of abstention value (i.e. not everyone who had the chance to interact necessarily did interact) for the like/dislike bar to be useful.

  2. have an idea of what those who commented felt (i.e. if their comments were +ve or -ve) without having to have to go through each one of their comments.

1

There are 1 best solutions below

0
On

I would say either of those inferences would be extremely dicey due to selection biases. That is, the set of voters is unlikely to be particularly representative of the set of viewers.

My thinking is that you could argue that the set of voters is a representative sample of the set of viewers if the voters were randomly drawn from the set of viewers, but that isn't the case. Instead, the voters chose to express their opinion in a vote whereas the non-voters chose not to. If that choice is correlated with their opinion on the video (and it seems likely it would be), then the set of voters is not representative of the set of viewers.

I would probably, as a first approximation, model non-voters as viewers who had neither positive nor negative opinion on the video itself. But even that is problematic since some may have had a strong opinion but not an account, and therefore they couldn't vote. Others may have had a strong opinion but still not voted, just because there is little incentive to do so. Therefore, I think any inference you do on the set of viewers using the set of voters as a sample will be deeply problematic.

In terms of how far off you would be, let me answer your question with a question. What exactly are you trying to estimate? And how important is accuracy? If you're trying to make a rough argument that viewers like cat videos more than dog videos for an undergraduate stats class, I think it's basically fine to do what you propose. If you're trying to argue a more important point and in a setting where rigor is more important (i.e. something you would submit to an academic journal), I don't think they would let you get away with ignoring the selection bias.