Background: I am a masters student in stochastic analysis. My course is very theoretical, which in general is fine by me, it is what I enjoy the most. From the more data-friendly subjects, I have (or will obtain/deepen) knowledge in stochastic processes of all kinds, time series, statistics, linear regression. Apart from those my course focuses on stochastic analysis, stochastic differential equations, spatial modelling and point processes.
I can work with R, Mathematica and a bit of Python and Javascript. And of course, one shouldn't forget: Excel.
Motivation: The final impetus that lead me to write this question is quite simple: recently I stumbled upon a data-analysis student competition in my city and thought about entering. I quickly realized I have literally zero idea what to do with data. Not that I am not good at it - I literally have no actual knowledge, apart from being able to answer narrow questions in statistics and regression from classes, possibly model the simples processes if I try very hard.
But more generally, I simply feel that since I am a mathematician, having data analysis skills is the sort of low-hanging fruit and it would be a great shame not to learn anything about it.
Goals: Here lies the problem. While I have a vague idea, my lack of knowledge is such that I do not know what is it I want. I understand this does not make for a well-posed question, but I am hoping shaping my goals is what Math.SE will help with, too. I think I'd like to stick with R, since it's a free software that I'll always be able to use (unlike Mathematica), but software choice is secondary. Vaguely, I'd like to:
1) Have the knowledge necessary to be able to theoretically compete in such a competition (doing badly is fine, but currently I am wondering about entering a marathon having only read about "legs" and "running" on wikipedia)
2) Be able to most utilise my knowledge of mathematics and make it my strength - if I competed against other people I'd probably be stomped into ground by even those who have just basic statistics and time series knowledge, but are good at working with data. Moreover, if I could somehow incorporate actual stochastics/spatial modelling, that would also be an interesting option.
3) I'll take a stab at guessing my goals - be able to do basic statistics/regression in R, model different processes, do experiments with random variables from different probability distributions, have the basic toolset for time series. Beyond that I'd really be completely guessing.
Questions: 1) What do I study? Are there topics/books considered to be the basics?
2) How can I best utilize my strengths - i.e. deeper understanding of mathematics? Say if I somehow had to compare data using an interesting metric that requires good deal of knowledge of metric spaces to be understood (yeah, I don't know what I am talking about). Then again, it's very likely that the simpler, the better. I'd simple like to be aware of possible strengths, but I really do want to be able to walk/crawl properly first.
3) The main question: what resources would you recommend for me, i.e. someone who isn't afraid of (or even welcomes) complicated mathematics? That is not to say a simple book may not be far more important, but I am not limited to them.
This is primarily a reference-request question (also to make it easier to answer, I suppose), but any answer consisting of general tips and thoughts on this matter will be very welcomed, too.
Btw, I wouldn't want to make it sound like a competition is the main motivation for me, as it really isn't. It's just that I think it's a useful benchmark for the "real-life" data skills I learned.
Thanks for any help!
If there is someone at your university or in your area who is an applied statistician, you should begin by having some discussions on these topics with him or her. Someone who knows you personally can give you a level of advice you can't get here.
I'm not sure what you might gain from participating in competitions because they are intentionally beyond the capabilities of most beginners. At the very least, you might get acquainted with some faculty and students who could give you some advice and support.
Statistics is a mathematical science, but parts of applied statistics tend to be more inductive than deductive. You don't have more than the available data to go on, and yet may hope or be expected to use the data to tell you a lot about the populations from which the data were sampled. Finding out how and why the data were collected is often an important first step. Starting with some descriptive statistics and graphics is usually helpful.
I suggest you start with traditional basic topics: descriptive statistics, one and two-sample t tests, chi-squared goodness-of-fit tests, one-factor (or one-way) analysis of variance, and simple (one predictor variable) linear regression. Try to get a good idea when these procedures are appropriate and what they can and cannot tell you about data. Then go on to more advanced topics.
Here are some books:
Although not a statistics text, Nate Silver: 'The signal and the noise' may give you a good idea about how a serious and productive statistician thinks. (The book discusses attempts to predict in a number of fields. It is OK to pick just the ones that seem most interesting.) Silver says he is a Bayesian statistician, and there is a Bayesian flavor to much of his work, but I'm not sure to what extent Bayesian statisticians would agree with all of his approaches. I found the book very interesting, but for recreational reading rather than for specific technical information.
Peter Dalgaard has a solid introductory book that uses R to analyze real (mainly biological) data. The 2nd edition has a lot more background about R than the 1st, which may be either good or bad depending on your current level of knowledge about R. (There are other reasonable books about learning statistics with R, but Dalgaard is one of the developers of R and I have found his explanations to be very clear.)
Ramsey & Shafer have a nice book 'Statistical Sleuth' about basic statistical applications. It uses no particular software package. It is interesting and clear enough to work well for self study. The mathematical level is low, but the book gives you a good sense of what serious applied statistics looks like.
Finally, these days there is a lot of press about 'data science' and 'big data'. These fields are new and might be difficult to navigate on your own. Typically, there is a huge amount of data and relatively little information about its quality and how it was collected. At this point, a lot of the ideas come more from computer science than from mathematics or statistics. Standards for what constitute useful results are under development. Currently chaotic fields with great potential for the future.