I have a set of large data and need to come up with a way to quantify correlation. I am thinking that I should use Bayesian Inference to tackle the problem.
The question in mind is to see how the attendence of classes of new students in their first month of university affects the their success. So the statement that I would like to reach is: "Given that new students attend all classes in their first month of university, there is a 50% chance that they will finish their degree with an A"
Could anyone direct me to the right direction please? The literature is overwhelming plus I need to do the data analysis in Python.
Thanks!
Welcome to MSE! When you have a large data set and want to apply Bayesian techniques, the first would be the learning of the structure such that a Bayesian network is established. Here you can use Pearson's chi-squared tests or heuristic methods (such as hill climbing) to configure such a network. The next step is to learn/estimate the conditional probabilities as described by the network. Then you are in the position to make inferences. For this, one usually has given data (i.e., some nodes/random variables of the network are instantiated/have values). Then you can use apply an inference algorithm to find the most probable states of the hidden/uninstantiated random variables.
One of the most prominent examples is the hidden Markov model (HMM) which has a given network structure. Learning/estimation of the conditional probabilities can be done by the expectation-maximation (EM) algorithm (or more efficiently by the Baum-Welsh algorithm) and inference by the famous Viterbi algorithm.