I have about 3.5GB worth of data that I need to get statistics from, all split up into about 71MB files, with an (approximately) equal number of samples. From this data, I would like to gather mean and standard deviation. Parsing the entirety of the standard deviation is probably a bad idea, since it's 3.5GB.
However, I know that with mean, I can at least (with some accuracy, since they're approximately the same size), take the average for each file, and then take the average of each of those sub-averages.
With standard deviation, it's a little more tricky, though. I've taken some time to run tests and found out that the standard deviation of a large sample size seems to be approximately similar to the average of the standard deviations of equivalently sized smaller chunks of samples. Does this actually hold true, or was that just a coincidence within the few tests that I've run? If it does hold true, then can I calculate what my percent error is probably going to be? Finally, is there a more accurate way to test for standard deviation that doesn't require me mulling over 3.5GB of data at a time?
You have a sampling problem. I would treat your large sample as a population, and then sample from that. First take a random sample of data, and test that the data is normally distributed and free of outliers. Then compute the sample standard deviation along with confidence intervals using a chi-square distribution.
$$\sqrt{\frac{(n-1)s^2}{\chi^2\left(df,\frac{\alpha}{2}\right)}} < \sigma < \sqrt{\frac{(n-1)s^2}{\chi^2\left(df,1-\frac{\alpha}{2}\right)}}$$
Bear in mind that there will always be a margin of error unless you compute the entire population, which seems impractical in your case.