In a previous question, which was nicely answered (Minimum number of samples), I was trying to know what was the minimum number of samples that reveals some statistical power.
I have a huge dataset formed by numerical values with some features attached to them. I start to apply filters to these values, filtering by some of the features that they have attached. As I apply filters and restrict more the conditions the number of samples is reduced (being F a feature):
n = 50.000
Filter those who have F1
n = 20.000
Filter those who have F2
....
n = 50
The thing is that I would like to know how meaningful is the last set (generated by applying all the filters). I know the mean and the standard deviation of this set.
In my previous question answered by @Nameless he wrote:
"The smaller (narrower) the interval, the more accurately your sample tells you something about the population."
I would like to know where can I find more background about that assertion. If I have two final sets, less dispersion in one of them means that the filtering is more meaningful because the records are more related?
Thanks a lot!
In general, I would read about confidence intervals and how they bracket your desired paraemeter.
However, as Nameless wrote in the previous answer, the condfidence intervals assume you have taken a random sample from your population. If some of your filters are numeric, then you have invalidated the random sampling criterion. If they are strictly features that do not, a priority, restrict the range of your data, then you should be OK.
If the above holds, then a small interval means that the estimate you produce will, statistically, be close to the true value. The measure of this closeness is given by the interva. For example, if you calculate a sample mean of, say, 10, and your 95% CI is (0,100), it is saying that the range of plausible values for the true mean given your sample mean and sample size, is quite large. On the other hand, if the interval were (9.5,10.5), then your estimator is quite good and there is only a small range of plausible values for the true mean.
As for your statement that less dispersion = more related, that would depend on the relationship between the numerical value and the attributes. I would think that if a set of objects share ALL features, they would be more related regardless of the numerical value or the dispersion, unless that numerical value is, in fact, some form of aggregate classification measure that locates objects along the number line.