Averages and sample sizes
My brain seems to not get statistics and probability. So I was wondering about how averaging things out is influenced by the amount of items being averaged, and in particular, how the underlying distribution will change things. Here are some stories to illustrate the issue:
Minting
You’re a Dutch banker working for the famous Wisselbank. Your job is to check whether the coins coming in are good. You know that each coin should weigh 25g ± 0.5g (I’m taking a lot of liberties with this story). Your method is to count out 100 coins, and see if they weigh 2.5kg. You just got a batch that weighs 2480g, which is 20g less than expected. But since each coin is ± 0.5g, and you have 100 coins, that should be ± 50g, right?
Network errors
You’re on monitoring duty at work. Previous tests showed that something like 2 requests out of every 1000 caused errors for random reasons (e.g. Russian hackers are probing your servers, or a user clicked too fast) and so can be safely ignored (there’s a task to fix that, but keeps being left for the next sprint). You just got an alarm in the middle of the night - over the last hour, there were 6840 network errors, at a request rate of 1k/s. This is 0.19% of all the requests, and the known error rate was 0.2%. Should you go back to sleep and have a look at it in the morning, or is this a real issue?
Party!!
You like organizing parties for your friends. Usually, when you invite 100 people, some people don’t show up, while others bring friends, so you assume that there will be between 90 - 110 people. You want to organize a weekly meeting for the next year, but don’t know how much swag to order. Assuming 100 people have signed up, and you have a very good deal lined up to get all the resources at once, how much should you order?
Bird watching
You’re an ornithologist that is studying the lesser spotted woodpecker (or any other species). You want to work out how many of them can be supported by a hectare of woodland. You’re going to visit various patches of forest (or maybe send your friends?) and count how many birds could be found during each trip. You know there is a magical way of calculating this, called hierarchical mixed models or something, but the only guy who actually understands them is currently busy with his habilitation and also has a large backlog of other people’s studies, so it should be enough to just take the average, right? If you check a bunch of places, model it as a Poisson distribution, and take the average, that should be fine, hopefully?
Differing distributions
If you have a distribution with a large spread, will that make the resulting averages more diffused? These stories all have different distributions (respectively: normal, binomial, uniform and poisson), so my initial assumption was that they will behave differently when averaged. I guessed that normal distributions should lend well to averaging, but that the others might need some extra mangling or something.
I’d read somewhere long ago (in a very interesting article that I unfortunately can’t find) that the mint story was a real deal in some bank or other, where they’d use this method to check coins. I vaguely recalled that the moral was something like “the larger the sample size, the smaller the expected variance”, but this would have to be checked.
If I had known enough statistics, I’d have realised that this was just a question about the central limit theorem. So seeing as I’m a much better programmer than I am a statistician, I did it the hard way, and started out by writing a simulator to test out various situations. The mint story (i.e. a normal distribution) was simple to get my head around - just generate a list of 100 items, add some random perpetuation to each of them and see if the average was ± 50g. This worked fine, and confirmed the central limit theorem - larger batches resulted in the average being a lot closer to 25g.
This is where things started getting tricky. I had assumed that everything is normal. And was trying to model the networks errors story as normally distributed. It’s quite obviously (in retrospect) not. Which leads to one of my insights - always make sure you know what distribution to use for modeling things. This is something I knew. I’d even explained to people that this is important. And yet.
Simulator
Here is the simulator. It works by running n
trials with the given batch size and then displays a graph of how often each
value (sum, average, standard deviation or variance) of each run is found. Each batch is structured by sampling batch size
items from the selected distribution, then calculating the statistics for it. This allows playing about with various parameters
to see how they actually work out. Check out the previous stories and see what happens.
Simulation options:
Batch options:
Results
Central limit theorem
The central limit theorem states that if you have independent and identically distributed random variables, then the sampling distribution of a sample mean is approximately normal, if the sample size is large enough. Notice that this doesn’t say anything about the underlying distribution, only about the sample mean - if you take the average of a (large enough) sample from any (with the appropriate caveats) distribution, you get a bell curve. This is huge! And pretty obvious, as it pops up everywhere… It’s basically the idea behind replicating studies, meta-studies etc. - if you take 100 nutrition studies about whether tomatoes cause cancer and average them out, you’ll get something near the real value, even if the original studies were all over the place. With the appropriate caveats, of course.
This much I pretty much knew. Maybe not explicitly, but it’s hard to spend much time around Science™ and not realise that taking the average of things is a powerful technique. Especially when you find out how the law of large numbers works (of which the CLT is just an extension). The really interesting thing, though is how it works relative to the sample size.
The CLT says that $\overline {X_n} = \frac {\sum_i X_i} {n}$ (i.e. the sample mean) has a distribution that is approximately Normal with mean $\mu$ and variance $\frac {{\sigma}^2} {n}$, where here $ \sigma ^ 2$ is the variance of the underlying distribution. This means that the standard deviation is $\frac {\sigma} {\sqrt n}$. Notice the square root there - this is crucial. The standard deviation of an sample mean is proportional to the square root of the sample size. Which implies quickly diminishing returns, what with this being nonlinear. The moral here being that bigger might well be better, but not by all that much - you get a lot more juice going from 10 -> 100 items per average, than from going 100 -> 1000. Reducing the error by $n$ requires $n^2$ more data points.
CLT in practice
How does this apply to the examples from the beginning? Let’s give it a whirl:
Minting
This is a case where the underlying random variable (the weight of the coin) can be modeled as a Normal distribution with a mean of 25g and a variance of 1 (this gives a standard deviation of ± 0.5g - to make it more of a hard limit, something like a variance of 0.05 would have to be used).
Running 1000 simulations of batches of 100 items gives a std dev of 0.1. Which makes sense, as $\frac 1 {\sqrt(100)} = 0.1$. This is reflected in the graph, where the average weights are between 24.7 - 25.3g. This should totally suffice to catch your non sophisticated cheater! Using batches 10 times larger only gives ~2 times smaller std devs, which is good, but would involve a lot more work.
Network errors
Network errors (assuming they’re independent - heh!) can be modeled as a Binomial distribution, i.e. n
trials, each of which has probability p
of succeeding. In this case, with 10k requests/s and a 0.2% error rate, that’s
Binomial(1000, 0.002)
. Let’s assume that metrics are collected per minute, which gives us batches of 60s * 1k requests
,
which is 60 000
trials per batch. There are 60 minutes in each hour, so in total that should be ~3 600 000
requests.
Out of that 0.2% should be errors, so we expect to get 7200 errors, or 3,592,800 successful requests. In the example, there
were 6800 errors. Is that ok?
Running 1000 simulations gives on average around 3,592,800 successes, which is ideal, according to the calculations (60k is a massive sample, after all). That being said, 6840 errors were pretty much never generated in any of the 1000 trials - there were always more. So even though 0.2% is pretty much 0.19%, when there is such a large sample size, it’s not. Does this also check out in theory? The experiential standard deviation for the error counts from the sample mean was 82.25. This means that around 68% of the per batch error counts should be 7200 ± 82. And around 96% should be 7200 ± 164. This is a very narrow band. Which is what you get from such a large sample size. This also checks out if one does the explicit calculations. The variance for a Binomial distribution is $ \sigma ^2 = k \dot p \dot (1 - p)$, where $k$ is the number of trials, and $p$ the probability of success. So in this case $ \sigma ^2 = 60,000 \cdot (1 - 0.002) \cdot 0.002 = 119.76 $ and $ \sigma = 10.94 $. The standard deviation for 60 batches should then be $ \frac {10.94} {\sqrt 60} = 1.41 $. Multiplying this by 60 (the number of batches) will transform it from the std dev of the averages to the std dev of the counts, which checks out, as $ 1.41 \cdot 60 = 84.77 $, which is close enough to the experiential result.
There were 6840 errors in the example, which is 360 fewer than expected, and 4.4 standard deviations away from the 0.2% error rate mean. This means that something fishy is going on, but in a positive way? As there are fewer errors than expected? Interesting… It can probably wait till tomorrow, though. This is fine…
Party
This one will assume a (discrete) uniform distribution, i.e. any value between 90-110 is equally likely. A Poisson or Binomial would probably be more realistic here, but it doesn’t really matter. In this case the batch size is 52, as the idea is to have a meeting each week.
1000 simulations resulted in an average of 100 (it would be strange if it wasn’t…) and a standard deviation of 0.81. Does this make sense in theory? The variance of a uniform distribution is $ \sigma = \frac {(b - a + 1)^2 -1} {12}$, where $a, b$ are the min and max values. So in this case, $\sigma ^2 = \frac {(110 - 90 + 1)^2 - 1} {12} = 36.(6)$. The expected std deviation for 52 meetings should then be $ \frac \sigma {\sqrt 52} = 0.84 $. Which is pretty much the same as the simulation generated!
Since 3 standard deviations contain 99.73% of the values (via the 68-96-99.7 rule), it seems that getting enough stuff for 52 * (100 ± 3) people, i.e. enough for between 5044 and 5356 people should be totally fine. This is a lot better than just getting enough for 52 * (100 ± 10) people, as that would be a range between 4680 and 5720, or a difference of 1040 vs 312 - that’s a 3 times larger span! Getting enough for 5300 people seems like a decent enough amount.
Bird watching
Here it’s assumed that $ \lambda = 3 $, i.e. that there were more or less 3 woodpeckers found in each place. To make this more
realistic, lets assume that you have counts from 10 previous trips, which are [4, 5, 1, 6, 3, 4, 2, 1, 1, 0]
. You ask ChatGTP
to estimate the $\lambda$ parameter for you, as one does, and it replies that:
The lambda parameter of a Poisson distribution is typically estimated as the mean of the data. To calculate this for the provided data
[4, 5, 1, 6, 3, 4, 2, 1, 1, 0]
, you would add all the numbers together and then divide by the count of the numbers.Sum of the values:
4 + 5 + 1 + 6 + 3 + 4 + 2 + 1 + 1 + 0 = 27
Count of the values:10
Mean of the values:27 / 10 = 2.7
So, the estimated lambda parameter for this data is
2.7
.
So far, so good. But 10 checks doesn’t seem that much. How many times should you go counting for this to be certain? 10 counts gives a standard deviation of 0.54. How much would be needed to get 3 std deviations to be around 1 (as this means that 99% of the time you’d have the correct lambda)?
The variance of a Poisson distribution is just its $\lambda$ parameter, which is probably going to be around 3. It would be nice to have a sample size large enough that the sample mean will be ± 0.5 away from the real value of $ \lambda $ 99% of the time (assuming that all woodland patches are and will be identical, including weather, time of day etc. - isn’t this fun!). So what you want is to find $n$ such that $ \sigma = \frac {\sqrt \lambda} {\sqrt n} $, where $ \sigma = \frac 1 3$ (as 99% is 3 std deviations) and $ \lambda = 3 $ (the variance of a Poisson distribution is $\lambda$, so it has to be square rooted here). A few transformations later, you have $ n = \frac {(\sqrt \lambda) ^2} {\sigma ^ 2} = \frac \lambda {\sigma ^ 2} = 3 \cdot \frac 1 { ({\frac 1 3}) ^2} = 27 $.
Does that check out? 1000 simulations of batches of 27 items from a Poisson distribution with $\lambda = 3$ results in a std deviation of 0.33. Which is exactly what was desired! So you can now go and check another 17 patches to get your missing data, and then finally publish yet another paper!
Lessons learnt
- taking an average of a batch of measurements really works and will reflect the underlying distribution of the measured thing
- the averages of samples from most distributions have a normal distribution -> probability statements about the averages of batches of measurements can be approximated with a normal distribution (but not necessarily the underlying variables!)
- the marginal value of collecting more data quickly diminishes - as the $ \sqrt n $. This would explain why so many examples, lessons, experiments etc. use a sample count of 30 - that really will often suffice
- as a rule of thumb, a sample size of 100 is only twice as good as 30, while 10k is only 10 times better than 100