How to calculate conditional probability for real data

433 Views Asked by At

I have my data in the form as:

| AuthorID | CoAuthorID | Year  |
|  677     | 901706     |  2005 |
|  677     | 901706     |  2005 |
|  677     | 901706     |  2005 |
|  677     | 838459     |  2007 |
|  677     | 901706     |  2007 |
|  677     | 1695352    |  2007 |
|  677     | 901706     |  2009 |
|  677     | 372089     |  2011 |
|  677     | 403400     |  2011 |  
|  ...     |            |       |
|  ...     |            |       |

I want to calculate yearly conditional probability for AuthorID given CoAuthorID whereas the formula for calculating conditional probability is:

P(AuthorID|CoAuthorID) = (P(CoAuthorID|AuthorID) * P(AuthorID)) / P(CoAuthorID)  

For instance if AuthorID is 677 and CoAuthorID is 901706 and we have to calculate Conditional Probability between them, then values for:
P(AuthorID) = P(677) = 1/390 = 0.00256
P(CoAuthorID) = P(901706) = 1/1 = 1.000
P(CoAuthor(D|AuthorID) = P(901706|677) = 5 = 5

For sake of explaining my calculations:

  1. For P(AuthorID), I have divided it with total number of AuthorIDs in data as 1/390
  2. For P(CoAuthorID), I have divided it with total number of CoAuthorIDs for the AuthorID i.e. 677 in year 2005 as 1/1 For P(CoAuthorID|AuthorID), I have counted total number of occurrences where AuthorID and CoAuthorID co-exists, as there were 5 so I put this in there.

Now I want to ask these:

  1. Should I take AuthorID only for year 2005 i.e. 390 or whole AuthorID count i.e. 499?
  2. I have counted CoAuthorID only for AuthorID i.e. 677 for calculating P(CoAutohrID) as 1/13. Is it right or should I have to consider total number of CoAuthorID i.e. 19411
  3. I have put 5 for P(CoAuthorID|AuthorID), considering only number of co-existence of AuthorID and CoAuthorID in whole data. Is it right or have I to divide it by some other value to calculate it?

Actually I know well about conditional probability as I have googled it and found examples relating only to Card, Deck, Dice or Coin but here is some sort of application that's why facing some application problems in understanding the concept.

I have tried this:

P(677)          =   1/390   0.0026  (AuthorID/Total AuthorID in 2005)
P(901706)       =   1/1     1.00    (CoAuthorID/Total CoAuthorID for AuthorID in 2005)
P(901706|677)   =   5       5.0     (Total CoAuthorID|AuthorID pairs in data)
P(901706|677)   =   3       3.0     (Total CoAuthorID|AuthorID pairs in data in 2005)
P(901706|677)   =   3/5     0.60    (CoAuthorID|AuthorID pairs in data in 2005/Total CoAuthorID|AuthorID pairs)

Which calculation is right for corresponding values required to put in to the formula of conditional probability ?

Please help in this regard.