I have my data in the form as:
+----------+------------+-------+
| AuthorID | CoAuthorID | Year |
+----------+------------+-------+
| 677 | 901706 | 2005 |
| 677 | 901706 | 2005 |
| 677 | 901706 | 2005 |
| 677 | 838459 | 2007 |
| 677 | 901706 | 2007 |
| 677 | 1695352 | 2007 |
| 677 | 901706 | 2009 |
| 677 | 372089 | 2011 |
| 677 | 403400 | 2011 |
| ... | | |
| ... | | |
+----------+------------+-------+
I want to calculate yearly conditional probability for AuthorID
given CoAuthorID
whereas the formula for calculating conditional probability is:
P(AuthorID|CoAuthorID) = (P(CoAuthorID|AuthorID) * P(AuthorID)) / P(CoAuthorID)
For instance if AuthorID
is 677
and CoAuthorID
is 901706
and we have to calculate Conditional Probability
between them, then values for:
P(AuthorID) = P(677) = 1/390 = 0.00256
P(CoAuthorID) = P(901706) = 1/1 = 1.000
P(CoAuthor(D|AuthorID) = P(901706|677) = 5 = 5
For sake of explaining my calculations:
- For
P(AuthorID)
, I have divided it with total number ofAuthorID
s in data as1/390
- For
P(CoAuthorID)
, I have divided it with total number ofCoAuthorID
s for theAuthorID
i.e.677
in year2005
as1/1
ForP(CoAuthorID|AuthorID)
, I have counted total number of occurrences whereAuthorID
andCoAuthorID
co-exists, as there were5
so I put this in there.
Now I want to ask these:
- Should I take
AuthorID
only for year 2005 i.e.390
or wholeAuthorID
count i.e.499
? - I have counted
CoAuthorID
only forAuthorID
i.e.677
for calculatingP(CoAutohrID)
as1/13
. Is it right or should I have to consider total number ofCoAuthorID
i.e.19411
- I have put
5
forP(CoAuthorID|AuthorID)
, considering only number of co-existence ofAuthorID
andCoAuthorID
in whole data. Is it right or have I to divide it by some other value to calculate it?
Actually I know well about conditional probability as I have googled it and found examples relating only to Card
, Deck
, Dice
or Coin
but here is some sort of application that's why facing some application problems in understanding the concept.
I have tried this:
P(677) = 1/390 0.0026 (AuthorID/Total AuthorID in 2005)
P(901706) = 1/1 1.00 (CoAuthorID/Total CoAuthorID for AuthorID in 2005)
P(901706|677) = 5 5.0 (Total CoAuthorID|AuthorID pairs in data)
P(901706|677) = 3 3.0 (Total CoAuthorID|AuthorID pairs in data in 2005)
P(901706|677) = 3/5 0.60 (CoAuthorID|AuthorID pairs in data in 2005/Total CoAuthorID|AuthorID pairs)
Which calculation is right for corresponding values required to put in to the formula of conditional probability ?
Please help in this regard.
Thanks!