I’ve seen this question come up a couple times, most recently on the python-dev mailing list. When you want to benchmark something, you naturally want to run the workload multiple times. But what is the best way to aggregate the multiple measurements? The two common ways are to take the minimum of them, and to take the average (but there are many more, such as “drop the highest and lowest and return the average of the rest”). The arguments I’ve seen for minimum/average are:
- The minimum is better because it better reflects the underlying model of benchmark results: that there is some ideal “best case”, which can be hampered by various slowdowns. Taking the minimum will give you a better estimate of the true behavior of the program.
- Taking the average provides better aggregation because it “uses all of the samples”.
These are both pretty abstract arguments — even if you agree with the logic, why does either argument mean that that approach is better?
I’m going to take a different approach to try to make this question a bit more rigorous, and show that there in different cases different metrics are better.
The first thing to do is to figure out how to formally compare two aggregation methods. I’m going to do this by saying the statistic which has lower variance is better. And by variance I mean variance of the aggregation statistic as the entire benchmarking process is run multiple times. When we benchmark two different algorithms, which statistic should we use so that the comparison has the lowest amount of random noise?
Quick note on the formalization — there may be a better way to do this. This particular way has the unfortunate result that “always return 0” is an unbeatable aggregation. It also slightly penalizes the average, since the average will be larger than the minimum so might be expected to have larger variance. But I think as long as we are not trying to game the scoring metric, it ends up working pretty well. This metric also has the nice property that it only focuses on the variance of the underlying distribution, not the mean, which reduces the number of benchmark distributions we have to consider.
The variance of the minimum/average is hard to calculate analytically (especially for the minimum), so we’re going to make it easy on ourselves and just do a Monte Carlo simulation. There are two big parameters to this simulation: our assumed model of benchmark results, and the number of times we sample from it (aka the number of benchmark runs we do). As we’ll see the results vary pretty dramatically on those two dimensions.
The first distribution to try is probably the most reasonable-sounding: we assume that the results are normally-distributed. For simplicity I’m using a normal distribution with mean 0 and standard deviation 1. Not entirely reasonable for benchmark results to have negative numbers, but as I mentioned, we are only interested in the variance and not the mean.
If we say that we sample one time (run the benchmark only once), the results are:
stddev of min: 1.005 stddev of avg: 1.005
Ok good, our testing setup is working. If you only have one sample, the two statistics are the same.
If we sample three times, the results are:
stddev of min: 0.75 stddev of avg: 0.58
And for 10 times:
stddev of min: 0.59 stddev of avg: 0.32
So the average pretty clearly is a better statistic for the normal distribution. Maybe there is something to the claim that the average is just a better statistic?
Let’s try another distribution, the log-normal distribution. This is a distribution whose logarithm is a normal distribution with, in this case, a mean of 0 and standard deviation of 1. Taking 3 samples from this, we get:
stddev of min: 0.45 stddev of avg: 1.25
The minimum is much better. But for fun we can also look at the max: it has a standard deviation of 3.05, which is much worse. Clearly the asymmetry of the lognormal distribution has a large effect on the answer here. I can’t think of a reasonable explanation for why benchmark results might be log-normally-distributed, but as a proxy for other right-skewed distributions this gives some pretty compelling results.
Update: I missed this the first time, but the minimum in these experiments is significantly smaller than the average, which I think might make these results a bit hard to interpret. But then again I still can’t think of a model that would produce a lognormal distribution so I guess it’s more of a thought-provoker anyway.
Or, the “random bad things might happen” distribution. This is the distribution that says “We will encounter N events. Each time we encounter one, with probability p it will slow down our program by 1/Np”. (The choice of 1/Np is to keep the mean constant as we vary N and p, and was probably unnecessary)
Let’s model some rare-and-very-bad event, like your hourly cron jobs running during one benchmark run, or your computer suddenly going into swap. Let’s say N=3 and p=.1. If we sample three times:
stddev of min: 0.48 stddev of avg: 0.99
Sampling 10 times:
stddev of min: 0.0 stddev of avg: 0.55
So the minimum does better. This seems to match with the argument people make for the minimum, that for this sort of distribution the minimum does a better job of “figuring out” what the underlying performance is like. I think this makes a lot of sense: if you accidentally put your computer to sleep during a benchmark, and wake it up the next day at which point the benchmark finishes, you wouldn’t say that you have to include that sample in the average. One can debate about whether that is proper, but the numbers clearly say that if a very rare event happens then you get less resulting variance if you ignore it.
But many of the things that affect performance occur on a much more frequent basis. One would expect that a single benchmark run encounters many “unfortunate” cache events during its run. Let’s try N=1000 and p=.1. Sampling 3 times:
stddev of min: 0.069 stddev of avg: 0.055
Sampling 10 times:
stddev of min: 0.054 stddev of avg: 0.030
Under this model, the average starts doing better again! The casual explanation is that with this many events, all runs will encounter some unfortunate ones, and the minimum can’t pierce through that. A slightly more formal explanation is that a binomial distribution with large N looks very much like a normal distribution.
There is a statistic of distributions that can help us understand this: skewness. This has a casual understanding that is close to the normal usage of the word, but also a formal numerical definition, which is scale-invariant and just based on the shape of the distribution. The higher the skewness, the more right-skewed the distribution. And, IIUC, we should be able to compare the skewness across the different distributions that I’ve picked out.
The skewness of the normal distribution is 0. The skewness of this particular log-normal distribution is 6.2 (and the poor-performing “max” statistic is the same as taking the min on a distribution with skewness -6.2). The skewness of the first binomial distribution (N=3, p=.1) is 1.54; the skewness of the second (N=1000, p=.1) is 0.08.
I don’t have any formal argument for it, but on these examples at least, the larger the skew (more right-skewed), the better the minimum does.
So which is “better”, taking the minimum or average? For any particular underlying distribution we can emprically say that one is better or the other, but there are different reasonable distributions for which different statistics end up being better. So for better or worse, the choice of which one is better comes down to what we think the underlying distribution will be like. It seems like it might come down to the amount of skew we expect.
Personally, I understand benchmark results to be fairly right-skewed: you will frequently see benchmark results that are much slower than normal (several standard deviations out), but you will never see any that are much faster than normal. When I see those happen, if I am taking a running average I will get annoyed since I feel like the results are then “messed up” (something that these numbers now give some formality to). So personally I use the minimum when I benchmark. But the Central Limit Theorem is strong: if the underlying behavior repeats many times, it will drive the distribution towards a normal one at which point the average becomes better. I think the next step would be to run some actual benchmark numbers a few hundred/thousand times and analyze the resulting distribution.
While this investigation was a bit less conclusive than I hoped, at least now we can move on from abstract arguments about why one metric appeals to us or not: there are cases when either one is definitively better.
One thing I didn’t really write about is that this analysis all assumes that, when comparing two benchmark runs, the mean shifts but the distribution does not. If we are changing the distribution as well, the question becomes more complicated — the minimum statistic will reward changes that make performance more variable.