Wednesday, August 15, 2018

Plotting CDF: Note to Self

Consider the histogram of samples from a normal distribution:

x = np.random.normal(0., 1., size=10000)
pdf, bins = np.histogram(x, normed=True)

The size of the array "bins" is not equal to the size of "pdf". Consecutive elements of "bins" specify the left and right edges of a particular bin. Thus, by default in python, "bins" array has 11 elements, while "pdf" has 10 elements.

Note that the matplotlib command "hist" is identical in this regard.

Now suppose you want to compare the histogram with the theoretical PDF (Gaussian). Using the histogram, one could construct an equivalent line chart by taking the mid point of each bin.

# the histogram of the data
pdf, bins, patches = plt.hist(x, 30, normed=1, facecolor='green', alpha=0.4)
xpdf = (bins[1:]+bins[:-1])/2 # midpoints
plt.plot(xpdf, pdf, 'o-')

# theoretical curve
xi = np.linspace(-4, 4)
gx = 1/np.sqrt(2.*np.pi)*np.exp(-xi**2/2)
plt.plot(xi, gx, 'k--')

Everything looks fine.

Now let's consider the CDF, and plot it against the theoretical CDF. If I use bin midpoints to plot the empirical CDF I get something funky.

from scipy.special import erf
cdf  = np.cumsum(pdf)
cdf  = cdf/cdf[-1]
plt.plot(xpdf, cdf, 'o')

gcdf = 0.5*(1 + erf(xi/np.sqrt(2.)))
plt.plot(xi, gcdf)

There is a visible offset.

Instead of using bin midpoints, I should use the right limits when plotting the CDF (this makes sense upon a moments reflection!).

xcdf = bins[1:]
plt.plot(xcdf, cdf, 'o')

gcdf = 0.5*(1 + erf(xi/np.sqrt(2.)))
plt.plot(xi, gcdf)


No comments: