In terms of information, how much more representative of a given dataset is an average than a random sample?

Question

In terms of information, how much more representative of a given dataset is an average than a random sample?

45 Views Asked by Bumbble Comm At 07 Apr 2026 - 4:47

I'm a data scientist by trade, and when dealing with truly "big" time-series data (on the order of TBs), I often have to make a decision between averaging and random sampling. That is, I have to either sacrifice time by taking the computationally and temporally expensive route of averaging all values across a time subset, or sacrifice information representativeness by taking the computationally cheap route of taking a random sample of that subset.

I took a class on Information Theory back in undergrad, and my intuition is that an average is more representative of a subset than a random sample, but I don't know how to quantify that naïve assumption.

I imagine the "information" obtained by those two methods is dependent upon various properties of the data, i.e. variance, but I am simply not sure.

Answers would be greatly appreciated! Thanks!

NOTE: I suppose it's important to mention that this isn't a strict dichotomy, rather, I could also average any number N of random samples. I gave the two ends of the spectrum just to simplify the problem somewhat.

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

The answer to your question depends deeply upon the task at hand. If you're interested in looking at the average of the difference of two points, randomly selected from ${\cal D}$, or the range of ${\cal D}$, then the mean is useless. You must use other measures, such as variance. If instead you want to know the average value, well of course the mean is all you need.

For your specific question about "information," well this too depends upon the distribution. If you have a Delta function distribution, then the mean tells you everything. If instead you have a complicated mixture of other distributions, then the mean tells you almost nothing and you need lots more information (bits) to describe the full distribution.

You should look up sufficient statistics (if the distribution is known), since knowing the values of the sufficient statistics of your distribution is (provably) all you will ever have to use to answer any question about your data set. For example, the sufficient statistics for a one-dimensional Gaussian distribution are its mean and variance. Those two numbers tell you everything you'll ever be able to know about that particular distribution, allowing you to calculate any moment, any expected differences in randomly sampled points, etc. The difference in information (in bits) between knowing just the mean and knowing the mean and variance is in the number of bits used to specify the variance.

In terms of information, how much more representative of a given dataset is an average than a random sample?

There are 1 best solutions below

Related Questions in PROBABILITY-THEORY

Related Questions in COMPUTER-SCIENCE

Related Questions in INFORMATION-THEORY

Trending Questions

Popular # Hahtags

Popular Questions