First, I'd like to apologyze just in case the question is too simple or not too well explained. Also I'm not trying to get an specific answer (as I understand it's not too defined), just some information to start my research.
I have a set of data. This set involves different numerical records from people that we want split in different groups. These groups aim to be more specific than the previous one, and join people with similar characteristics.
GENERAL DATA
=============
P1 400
P2 355
P3 255
...
PN 650
=============
SAMPLES: 5000
AVG: 322
FILTERED DATA (only some records can be inside this group)
=============
P1 400
P3 255
...
PX 122
=============
SAMPLES: 455
AVG: 245
If we have 5000 samples in our data, one of these segmentations could have just 60. What I'd like to know is what are the minimum number of samples that we need to be sure that the number of samples in the group represent the reality in some way (I mean, I have samples enough to have some statistical power).
I undestand that I have to find work related to 'Sample size determination' but I don't really get what kind of statistical distribution I'm working with.
What I really need is to know the the average in the group is representative.
Thanks a lot!
In statistics, if you want to make inference about characteristics of the population ($N=5000$) based on small samples ($n=455$), then two things play a role. Sample size, what you are worried about, and variance of the data. Hence, if there is very little variance in the data, then you only need a very small sample size to be confident that the sample average is representative of the population. On the other hand, if your data has a lot of variance, then you will need larger sample sizes.
This isn't very helpful from a practical point of view, but this is the main trade-off. If you are familiar with significance/confidence levels, then what you might want to do is compute confidence intervals. These give you some upper and some lower bound for your sample mean $\bar{x}=1/n \sum_i P_i$, i.e., they give you an interval where the population mean likely is, based on the data of your sample. For example, if your sample of $n=455$ has a mean of $\bar{x}=245$, then a 95% confidence interval may be $[245-30,245+30]$ (this depends on the data). The smaller (narrower) the interval, the more accurately your sample tells you something about the population. The width of the confidence interval depends, as mentioned above, on the sample size (the more the smaller the interval) and variance of the data (the more the wider the interval).
If you are not familiar with confidence intervals, you will have to rely on rules of thumb. 10% of the population is good, provided that you draw your sample randomly. This is the most crucial point in this endeavor. Other than that, more is better.