sorry for probably silly question, it's the first time when I need to do such work. I have large data set with regarding clicks on some element on web page. It contains some characteristics of such element (size, position, with image or not, etc) and whether or not it was clicked, something like this: Answer_Position Size Image_Present OperatingSystem WasClick
I'm using Naïve Bayesian Classifier to predict probability of clicks when I change size, position, etc. But I have two questions and don't know where to dig: - how to determine if there is any dependency at all? Maybe click/not click do not depend on any of those factors? - How to find most important factors? For example, User's operating system have no effect on whether he clicks or not, etc?
Can you please point me to an algorithms that can do that?
Thank you!
Your dependent variable is the 0/1 dummy WasClick. You can use one of the popular discrete choice models Logit / Probit, or simply run an OLS regression with your data. As independent variables, you use your predictors. An OLS regression might look like this: $$WasClick_i=\beta_1 OSWindows_i+\beta_2 OSMAC_i+\beta_3 OSLinux_i+\beta_4 OSother_i+\beta_5 ImagePresent_i+\beta_6 Size_i+\beta X_i+\epsilon_i,$$ where each observation $i$ is the user-picture. You should probably account for the correlation of clicks between different elements but the same user (e.g., cluster standard errors).
With the estimated model you can see how strong the predicted impact of each factor is. Basically, with your estimates you can compute the conditional expectation $E[Click_i|X]$, where $X$ are all your predictors - assuming the model specification of logit/probit/ols is correct. With this, you could find out if a Mac user on average clicks more often than a windows user, or whether a large sized element is clicked more often.
Once you have estimated the model, you can also test each individual factor (e.g.,Image-Present) for significance. Significance tells you whether you can be confident that the factor has a nonzero impact on the outcome. Maybe the factor has an impact, but your data is too noisy to be confident enough to conclude there is an impact. Or it doesn't have an impact, but by chance it looks as if it does. That's why you typically test for significance. When a factor is estimated to significantly different from zero, one typically says "there is an impact".
If you do this, and if you want to do this right (rather than quick and dirty), you should read up on this.