Maths concept for salient point in graph data

1k Views Asked by At

I have collected data concerning the total post counts of users in an online forum (see graphic).

enter image description here

What I am hoping to do is compare the language of 'first posts' with the language of 'later posts'. My issue is in how best to define 'later posts' without relying on a totally arbitrary judgement. I also cannot analyse the content of posts until after I have segmented the data. Segmentation must precede analysis.

It seems intuitively, based on the graph, that somewhere between five and 10 is the 'sweet spot'. (Though it's totally arbitrary, at 10 posts, the forum gives people a new 'rank', where they are no longer a 'newbie'.)

I've had no maths training (sorry about difficulty in post title and tagging), but I'm hoping there is some kind of concept that could be used to justify the selection of a number of posts that serves as the cut-off point for 'later posts'. Also at issue are two things: one is the fact that a few individual users could be totally overrepresented in the data if I make the cut off number too high; two is that it would be handy for the 'first posts' and 'later posts' selections to be at least vaguely similar in word-count.

Any ideas greatly appreciated.

1

There are 1 best solutions below

2
On BEST ANSWER

Your first sentence is beyond me as I am not a linguist. However, as a statistician, I can see a possible hypothesis test lurking in your problem formulation. Instead of getting hung up on what constitues a "later post", I would recommend that you select your breakpoint not based on linguistic considerations, but simply to create two samples with approximately equal numbers of posts. The sample that contains the earlier posts can be called, intuitively, the early sample, and the remaining posts can be the later sample.

Now, the great thing about hypothesis testing is that you only need to know what your samples should look like under your "null" hypothesis. In your case, I would choose the null hypothesis: "There is no socialization to community norms at the level of discourse" vs. the alternative hypothesis "There exists socialization to community norms at the level of discourse." Now, at this point, these two hypotheses are purly qualitative, so you will need to develop a quantative measure of "distance from the norm" that can be applied to each post. Lets call this distance measure for particluar post $D(i)$ for the distance that post i is from the established norm. If $DIi)=0$ it means the post perfectly conforms to the community norm, while $D(i)>0$ measures how "different" the post is from the norm (this is where your linguistic knowledge will be critical....I have no idea what such a measure would look like).

If you don't want to have just ONE way to measure differences, you can of course create several such metrics, e.g., $D_1, D_2,$ etc. This would result in each post being assigned a vector instead of a single number. In that case, one measure of the difference would be the euclidean norm of the vector ($\sqrt{\sum D_i^2}$).

Either way, you will be assigning a single number, X, to each post that measures "difference from the norm:. We will use these numbers to create quantative versions of the earlier hypotheses:

Null Hypothesis is $H_0: E[X_{early}]-E[X_{late}] = 0$

Alternative hypothesis $H_a: E[X_{early}]-E[X_{late}] > 0$

Test Statsitic: $\Delta = \bar X_{early} - \bar X_{late}$

Now we have an actual test that we can perform. I doubt that the test statistic will be normally distributed. Therefore, I would suggest running a permutation test at the 0.05 significance level. Basically, the test will combining your early and late datasets into one dataset and then calculating your test statistic under every possible way to create two equal sized datasets from this combined dataset. Your p-value is simply the number of such permutations that resulted in a test statistic grater than or equal to one one calculated from the actual groupings divided by the total number of permutations.