Monday, August 6, 2012

Don't use gnuplot "fit" blindly!

Gnuplot is a great plotting/data-analysis program. It has a fairly robust tool called "fit" which can be used to perform nonlinear least-squares regressions very conveniently.

It does what it advertizes, and infact, does it quite well.

But let us consider a simple cautionary example.

Consider a linear model $Y = P_1 + P_2 X$, with $P_1$ = 2, and $P_2$ = 1.5. Let us generate data using this model, and add some white noise to it. In particular we assume that $Y_e = N(Y,\sigma=0.5)$ is the experimental data which is normally distributed about the linear model with a standard deviation of 0.5.

 The red line is the model, the error-bars denote the standard deviation of the Gaussian noise, and the circles represent a particular realization of experimental data.
We can generate experimental data using the prescribed model and noise profile. This is shown in the figure above.

If we use gnuplot "fit" to fit through the data-points above we get:

Final set of parameters Asymptotic Standard Error
======================= ==========================
P1 = 1.98866 +/- 0.1156 (5.812%)
P2 = 1.4667 +/- 0.03817 (2.603%)

This looks quite good. Even the parameter estimates look quite sharp.

In a way, this is expected because "fit" assumes that the error in $Y$ is distributed normally (which in this case it is!).

Now let us convert this idealized problem, which we grok completely, into something slightly more complicated.

Let us apply a nonlinear transformation on all the participating variables. Thus, we define $x = \exp(X)$ and $y=\exp(Y)$. A replotting of this transformed problem looks like:
This picture is the same as the first figure with the exponential transformation applied. The location of the circles and the red line is the same, except that they are transformed. The location of the bottom and top tips of the error bars are also the same, except that they too are transformed. In short, if I plotted this figure on a log-log plot, it would "look" identical to the first figure.

However notice two important differences. One, the size of the error bars is not the same. It covaries with the magnitude of $y$. Two, the error bars are not even symmetrical. In fact, it can be shown that if the variable $Y \sim N(\mu,\sigma)$, then $y = \exp(Y) \sim \text{lgN}(\mu,\sigma)$, where $\text{lgN}$ refers to the log-normal distribution.

Let us throw this non-linear problem at gnuplot. Clearly, we don't expect things to go as smoothly as before, because the distribution of the error violates the Gaussian assumption in gnuplot's "fit".

Final set of parameters Asymptotic Standard Error

======================= ==========================
p1 = 4.68963 +/- 0.7653 (16.32%)
p2 = 1.63074 +/- 0.03327 (2.04%)

Note that not only are the parameter values wrong ($p_1 = \exp(P_1)$ should be 7.39 and not 4.69, and $p_2 = P_2$ should be 1.5 not 1.63), but the error bars are quite meaningless.

It is particularly important to realize that this is *bad* even though the "fit" visually looks better than the actual model!

But lift the veil by replotting the figure above on a log-log scale, and you can guess what exactly is going on!

1 comment:

Aneesh Chand said...

Hi,
Can you give the syntax for generating the data points from the linear model with noise in it?

Thanks.