Sunday, August 18, 2013

Block Averaging: Estimating Uncertainty (Part 2)

In the last post, we ended with a conundrum.

To summarize:

When we have a timeseries of independent samples, estimating the mean and the uncertainty of that estimate is straightforward. The central-limit theorem tells us everything we need to know:  \(\sigma_{\langle x \rangle} = \sigma_x/\sqrt{N}\).

However, when successive data points are correlated they tell us "less" than  uncorrelated data points. If we ignore this diminished importance of points, and pretend as if they were as good as uncorrelated samples (by lavishing the central-limit theorem on them), we underestimate the true uncertainty in \(\sigma_{\langle x \rangle}\).

In the previous example, we found  \(\langle x \rangle = 4.686 \pm 0.057\).

The problem with this result is not necessarily that the mean is not as close to "5" as we would like, but that the error-bar (standard deviation) gives us a false sense of security.

Note that in this example, we knew the true answer. In a molecular simulation, we don't know the true answer, and it is easier to be seduced by small error-bars.

In short, the real problem with treating correlated data is that the present analysis underestimates the true error-bar.

Now let us see, what we can change. In "the real world" we cannot change the fact that data are going to be correlated or that the estimated mean is not as close to the "true" mean as we would like, unless we are prepared to carry out extraordinarily long simulations (or extraordinarily many independent simulations). Thus, we set ourselves a fairly modest task.

We want more reliable estimates of \(\langle x \rangle\). Here is where a trick called block averaging can be surprisingly helpful. This one slide from David Kofke's lecture notes on Molecular simulation explains the whole idea.

The idea is to chop the dataseries into \(n\) chunks or "blocks" of different size "b" (n * b = N). We compute the mean of each block \(\bar{m}_i, ~ i = 1, 2, ... n\), and compute the standard deviation of the block averages, \(\sigma_{\bar{m}_b}\). The subscript "b" denotes that this standard deviation is for blocks of size "b".

The simulation error is estimated as \(\sigma_{\langle x \rangle} = \sigma_{\bar{m}_b}/\sqrt{n-1}\). Luckily for you, I wrote an Octave script which does this calculation for you.

If you plot this quantity versus "b", you get a quantity that gradually increases, and then plateaus out. For the correlated data set from the previous post, we get:

We get plateau estimate of \(\sigma_{\langle x \rangle} \approx 0.296\), which implies a much more consistent "1 sigma" estimate of \(\langle x \rangle = 4.686 \pm 0.296\).

As you can see, block averaging reflects the true uncertainty in the data, which is larger than that implied by a mindless application of the central-limit theorem.

What happens if we subject our uncorrelated data from the previous post to the block averaging treatment.

See for yourself:

The standard deviation doesn't change with block size.

6 comments:

milieu said...

Thanks for clarifying the concept of block averaging using the toy example!

Sachin Shanbhag said...

you're welcome.

Anonymous said...

I was so lost with what was going on in some homework I have and you really cleared it up. Now you got me interested in the topic. Thanks!

Sachin Shanbhag said...

Glad to be of service!

gojohnnygo said...
This comment has been removed by the author.
gojohnnygo said...

nice thinking here for a subject with incredibly few literature. luckily there is this source who published this very blocking idea with some more thorough mathematical analysis: Flyvberg and Petersen 1989 Error estimates on averages of correlated data "blocking method".
link: https://www.researchgate.net/profile/Henrik_Flyvbjerg/publication/258024272_Error_Estimates_on_Averages_of_Correlated_Data/links/0deec52791e24127d8000000/Error-Estimates-on-Averages-of-Correlated-Data.pdf