Wednesday, August 14, 2013

Block Averaging: Estimating Uncertainty (Part 1)

When you periodically sample a property (eg. thermodynamic attributes such as pressure, static properties such as radius of gyration of a polymer molecule, etc.) in "equilibrium" molecular dynamics or Monte Carlo simulations, you end up with a time series of that property.

For specificity, let us call this property \(x(t)\), where \(t = 1, 2, ..., N\) are discrete time steps (easy to adapt this discussion to cases where it is not discrete). Usually, the sampling frequency is greater than the characteristic relaxation time of the system, which makes successive measurements of \(x\) "correlated" or "non-independent".

The weather or the stock market on successive days is correlated because the timescale of a weather system or a business news item is often a few days; however the successive tosses of a fair coin or pair of dice can be expected to be uncorrelated.

Uncorrelated data-points are nice and desirable because they save us a lot of trouble. We can not only find average value \[\langle x \rangle = \frac{x(1) + x(2) + ... + x(N)}{N},\]  of the property easily, but also associate an error bar or standard deviation.

Thus, if \(\sigma_x\) is the standard deviation of \(x\), then the central-limit theorem tells us that \(\sigma_{\langle x \rangle} = \sigma_x/\sqrt{N}\).

Let us consider a simple example. I generated time series with N = 1000 samples of a random number with mean value of "5" using the Octave command: x = 5 + 2 * rand(N,1) - 1, to add a uniform random number between [-1, 1].

I plot the time-series (or vector in Octave parlance), and its autocorrelation function computed via this script, in the following figure.
As expected, \(\sigma_x = 0.575 \approx 1/\sqrt{3}\), and \(\langle x \rangle = 4.983 \pm 0.018\), where the standard deviation of the mean is 0.575/sqrt(1000). This gives a restrictive "1 sigma" range for the mean between 4.965 and 5.001, between which the mean value of "5" falls, and everything is hunky dory.

We also see the autocorrelation function quickly decay from 1 to zero. This is a hallmark of an uncorrelated process. If successive samples were correlated, then the autocorrelation function would take its own sweet time to fall to zero (as we shall see in an example shortly). The amount of time it takes to fall to zero or "decorrelate" is a measure of how correlated the data is.

Now let me make up correlated process. In the following example, I set x(1) = 0; x(t+1) = 0.95 * x(t) + 2 * rand(N,1) - 1, and then shifted all the values by "5". This is an autoregressive model, but we don't have to know anything about that here.

The time-series and its autocorrelation function look as follows:

You can "see" the correlation in the data-series visually. The series fluctuates around "5" (not completely obvious) in a qualitatively different manner from the uncorrelated series. This is apparent in the autocorrelation function, which now decays to 0 around \(t = 50\). The red line is the autocorrelation of this particular dataset, and the black line is the average over 100 such datasets.

Here \(\sigma_x = 1.791\), and naively using  \(\sigma_{\langle x \rangle} = \sigma_x/\sqrt{N}\), we get \(\langle x \rangle = 4.686 \pm 0.057\). This "1 sigma" range for the mean between 4.629 and 4.743, which is not the "correct" bound for the mean value of "5". The "2 sigma" or 95% interval is not much better.

Clearly something is amiss here. We'll find out more in the next installment.


No comments: