Use of language on wikipedia - what kind of distribution?

46 Views Asked by At

I have an interesting problem and was wondering whether anyone would be able to point me in the right direction.

I am wondering whether the use of a word in the english language on Wikipedia is normally distributed and/or whether normal distribution could be used as an approximation to calculate a p-value.

The problem:

  1. Take a search term lets say 'news'

  2. Crawl 10000 entries on Wikipedia and record the number of occurrences

  3. We know that the occurrence of the word 'attic' in the entire english language is 2722 (based on http://www.wordfrequency.info/intro.asp)

Could we use this information to develop a Hypothesis test to ascertain whether the results from our 10000-page crawls are statistically significant?

If not - how could one go about determining statistical significance?

I assume the English language is not normally distributed. Could normal distribution still be used as an approximation?

Thank you very much M

1

There are 1 best solutions below

0
On

You've got an implicit model here: that Wikipedia pages are generated from some distribution, perhaps by random draws of a random number of words. (Alternatively, you're saying that wiki pages are your measure space, and there's a 0/1 random variable saying whether or not the word "news" appears in the page; in that case, it's not clear what the underlying probability mass function on the set of pages would be.)

You can test whether, with this assumption of how pages are generated, some page or other is anomalous, or some word is used more often than it should be, but any results you find may well end up being an indictment of your model rather than a point of serious statistical significance.

For instance, the word "Wikipedia" appears disproportionately often on wikipedia pages. So does the word 'external'. Neither of these is the slightest bit surprising, despite their high variation from the usual word-frequencies for English.