How to use Bayesian Inference for a large set of data?

121 Views Asked by At

I have a set of large data and need to come up with a way to quantify correlation. I am thinking that I should use Bayesian Inference to tackle the problem.

The question in mind is to see how the attendence of classes of new students in their first month of university affects the their success. So the statement that I would like to reach is: "Given that new students attend all classes in their first month of university, there is a 50% chance that they will finish their degree with an A"

Could anyone direct me to the right direction please? The literature is overwhelming plus I need to do the data analysis in Python.

Thanks!

3

There are 3 best solutions below

0
On

Welcome to MSE! When you have a large data set and want to apply Bayesian techniques, the first would be the learning of the structure such that a Bayesian network is established. Here you can use Pearson's chi-squared tests or heuristic methods (such as hill climbing) to configure such a network. The next step is to learn/estimate the conditional probabilities as described by the network. Then you are in the position to make inferences. For this, one usually has given data (i.e., some nodes/random variables of the network are instantiated/have values). Then you can use apply an inference algorithm to find the most probable states of the hidden/uninstantiated random variables.

One of the most prominent examples is the hidden Markov model (HMM) which has a given network structure. Learning/estimation of the conditional probabilities can be done by the expectation-maximation (EM) algorithm (or more efficiently by the Baum-Welsh algorithm) and inference by the famous Viterbi algorithm.

0
On

If the case that you describe is actually the case you want to solve, it appears that you want something simple, namely $$ \mathbb P(\text{grade $A$} \mid \text{attend all classes month 1}). $$ I will now abbreviate this as $\mathbb P(A \mid C)$. Note that by definition, $$ \mathbb P(A \mid C) = \frac{\mathbb P(A \land C)}{\mathbb P(C)}, $$ and this is something you can estimate directly from the data: it is just $$ \frac{\text{number of students who got an $A$ and attended all classes month 1}}{\text{number of students who attended all classes month 1}}. $$ If you must apply Bayes' rule, you get $$ \mathbb P(A \mid C) = \mathbb P(C \mid A) \times \frac{\mathbb P(A)}{\mathbb P(C)}. $$ You can also compute all the quantities on the right separately from your data, but it is just doing redundant work.

0
On

For Bayesian data analysis, you can have a look at Stan. This is a well documented and tested framework for doing Bayesian analysis, see these case studies.

Once that you have defined your model, you can solve it using likelihood maximization (not really Bayesian as this is a pointwise estimation), Hamiltonian MC or Variational inference (a good choice if your model has a lot of parameters).

There are a lot of wrappers: R, python etc... Stan interfaces