Friday, August 24, 2018

Data Visualization Resources

Here are some resources that one of my students (Eitan Lees) shared during a graduate seminar on visualization.
  • The Data Visualisation Catalogue has an exhaustive catalog of different charts and graphical representations. It is a useful first place to browse when figuring out what kind of plot to use.
  • Lot of common "errors", and practical pointers on improving graphical presentation is available at this wonderful site.
  • A generally good idea is to remove to improve
  • Paletton is a site where you can play with colors and color schemes.

Saturday, August 18, 2018

Mean Squared Displacement

The mean-squared dislacement of a particle is defined simply as, \[\rho(t) = \langle r^2(t) \rangle = \int c(r) p(r,t) dr,\] where \(c(r) = 1\), \(2 \pi r\), and \(4 \pi r^2\) in 1, 2, and 3 dimensions, respectively. This evaluates to \(\langle r^2(t) \rangle = 2dDt\), where \(d\) is the dimension (1, 2, or 3).

One can also compute the variance of the MSD, as \[\text{var}(\rho) = \langle \rho^2(t) \rangle - \left(\langle \rho(t) \rangle\right)^2.\] This can be evaluated as,
\begin{align}
1D:& 2\,(2Dt)^2 = 8D^2t^2\\
2D:& \dfrac{2}{2} (4Dt)^2 = 16 D^2 t^2\\
3D:& \dfrac{2}{3} (6Dt)^2 = 24 D^2 t^2
\end{align}
This can be simplified into a common expression as, \[\text{var}(\rho) = \dfrac{2}{d} \rho^2\]

Wednesday, August 15, 2018

Plotting CDF: Note to Self

Consider the histogram of samples from a normal distribution:

x = np.random.normal(0., 1., size=10000)
pdf, bins = np.histogram(x, normed=True)

The size of the array "bins" is not equal to the size of "pdf". Consecutive elements of "bins" specify the left and right edges of a particular bin. Thus, by default in python, "bins" array has 11 elements, while "pdf" has 10 elements.

Note that the matplotlib command "hist" is identical in this regard.

Now suppose you want to compare the histogram with the theoretical PDF (Gaussian). Using the histogram, one could construct an equivalent line chart by taking the mid point of each bin.

# the histogram of the data
pdf, bins, patches = plt.hist(x, 30, normed=1, facecolor='green', alpha=0.4)
xpdf = (bins[1:]+bins[:-1])/2 # midpoints
plt.plot(xpdf, pdf, 'o-')

# theoretical curve
xi = np.linspace(-4, 4)
gx = 1/np.sqrt(2.*np.pi)*np.exp(-xi**2/2)
plt.plot(xi, gx, 'k--')

Everything looks fine.

Now let's consider the CDF, and plot it against the theoretical CDF. If I use bin midpoints to plot the empirical CDF I get something funky.

from scipy.special import erf
cdf  = np.cumsum(pdf)
cdf  = cdf/cdf[-1]
plt.plot(xpdf, cdf, 'o')

gcdf = 0.5*(1 + erf(xi/np.sqrt(2.)))
plt.plot(xi, gcdf)

There is a visible offset.

Instead of using bin midpoints, I should use the right limits when plotting the CDF (this makes sense upon a moments reflection!).

xcdf = bins[1:]
plt.plot(xcdf, cdf, 'o')

gcdf = 0.5*(1 + erf(xi/np.sqrt(2.)))
plt.plot(xi, gcdf)


Wednesday, August 8, 2018

Links

1. Yogic Capitalism (Bloomberg on Baba Ramdev)

2. Netflix has a decent documentary on him

3. A history of punctuation in English (Ashley Timms)

4. Tim Urban has some career advice