I have my data in the form as:
+----------+------------+-------+
| AuthorID | CoAuthorID | Year |
+----------+------------+-------+
| 677 | 901706 | 2005 |
| 677 | 901706 | 2005 |
| 677 | 901706 | 2005 |
| 677 | 838459 | 2007 |
| 677 | 901706 | 2007 |
| 677 | 1695352 | 2007 |
| 677 | 901706 | 2009 |
| 677 | 372089 | 2011 |
| 677 | 403400 | 2011 |
| ... | | |
| ... | | |
+----------+------------+-------+
I want to calculate yearly conditional probability for AuthorID given CoAuthorID
whereas the formula for calculating conditional probability is:
P(AuthorID|CoAuthorID) = (P(CoAuthorID|AuthorID) * P(AuthorID)) / P(CoAuthorID)
For instance if AuthorID is 677 and CoAuthorID is 901706 and we have to calculate Conditional Probability between them, then values for:
P(AuthorID) = P(677) = 1/390 = 0.00256
P(CoAuthorID) = P(901706) = 1/1 = 1.000
P(CoAuthor(D|AuthorID) = P(901706|677) = 5 = 5
For sake of explaining my calculations:
- For
P(AuthorID), I have divided it with total number ofAuthorIDs in data as1/390 - For
P(CoAuthorID), I have divided it with total number ofCoAuthorIDs for theAuthorIDi.e.677in year2005as1/1ForP(CoAuthorID|AuthorID), I have counted total number of occurrences whereAuthorIDandCoAuthorIDco-exists, as there were5so I put this in there.
Now I want to ask these:
- Should I take
AuthorIDonly for year 2005 i.e.390or wholeAuthorIDcount i.e.499? - I have counted
CoAuthorIDonly forAuthorIDi.e.677for calculatingP(CoAutohrID)as1/13. Is it right or should I have to consider total number ofCoAuthorIDi.e.19411 - I have put
5forP(CoAuthorID|AuthorID), considering only number of co-existence ofAuthorIDandCoAuthorIDin whole data. Is it right or have I to divide it by some other value to calculate it?
Actually I know well about conditional probability as I have googled it and found examples relating only to Card, Deck, Dice or Coin but here is some sort of application that's why facing some application problems in understanding the concept.
I have tried this:
P(677) = 1/390 0.0026 (AuthorID/Total AuthorID in 2005)
P(901706) = 1/1 1.00 (CoAuthorID/Total CoAuthorID for AuthorID in 2005)
P(901706|677) = 5 5.0 (Total CoAuthorID|AuthorID pairs in data)
P(901706|677) = 3 3.0 (Total CoAuthorID|AuthorID pairs in data in 2005)
P(901706|677) = 3/5 0.60 (CoAuthorID|AuthorID pairs in data in 2005/Total CoAuthorID|AuthorID pairs)
Which calculation is right for corresponding values required to put in to the formula of conditional probability ?
Please help in this regard.
Thanks!