Clueless Fundatma: March 2011

Tuesday, March 29, 2011

Polygonal kitchen sinks

I heard about this study in Nature (subscription required) in a recent talk. See some pictures here.

When a water jet strikes a flat surface at high Reynolds number, it normally creates a circular hydraulic jump. In the paper linked above, Ellegaard et al. demonstrate that stationary polygonal patterns can be formed (instead of circles) when high viscosity fluids are used.

Fascinating.

Thursday, March 24, 2011

"Visual" Links

1. Radiation Dose Chart: xkcd presents an illuminating graphic. We often forget that we are immersed in radiation.

2. Gallon to Gallon: Stuff that is more expensive than gasolene. It reminded me of an observation a friend made, when he first came to the US from India: "It's funny how the price of water is sometimes more than milk, which itself is sometimes more than gasolene, and peanuts and cashews cost about the same!" (via FlowingData). There is also an interesting chart at FlowingData showing the geographic distribution of gasolene prices.

3. Spectral Mesh Compression (pdf lecture): Interesting things happen, when you transform them (via Mathematical Poetics).

Friday, March 18, 2011

Finance Links: Skill versus Luck

Two interesting articles:

1. Luck versus skill: How can you tell? Aswath Damodaran on why it is much harder to separate the two in finance than in other fields like sports.

2. Untangling skill and luck (pdf). The introduction to this piece by Michael Mauboussin itself is enough to make you want to read on.

For almost two centuries, Spain has hosted an enormously popular Christmas lottery. Based on payout, it is the biggest lottery in the world and nearly all Spaniards play. In the mid 1970s, a man sought a ticket with the last two digits ending in 48. He found a ticket, bought it, and then won the lottery. When asked why he was so intent on finding that number, he replied, "I dreamed of the number seven for seven straight nights. And 7 times 7 is 48."

Monday, March 14, 2011

How to place a tight "bounding box" around an EPS image?

Recently, I received an EPS file with extra white space around the actual figure. There are a number of ways of trimming the extra white space, including opening up the EPS file in a text editor and manually redefining the bounding box (this is simpler than it sounds, especially once you've done it a couple of times).

However, if you run Linux, and have ghostscript installed (by default on most distributions) you can use the eps2eps wrapper:

eps2eps FileWithWhiteSpace.eps FileTrimmedWhiteSpace.eps

Simple.

Thursday, March 10, 2011

Domain-based overconfidence

I was reading a financial blog, that I peruse occasionally. In a recent post, there were two "puzzles". Usually, this is enough to stop me from passing on.

The Dow Jones Price Index does not include the effects of dividend re-investment. If dividends had been considered re-invested in the index since its inception in 1896, at what price level would the index be at today? Provide a 90% confidence interval around your answer (i.e. you are 90% confident that your interval includes the right answer).

There are 100 bags, each containing 1000 poker chips. 45 bags have 700 black chips and 300 red chips, while 55 bags have 700 red chips and 300 black chips. If you select a bag, what is the probability that most of the chips are black? If you pulled out 12 chips from that bag, and 8 of them are black and 4 of them are red, now what is the probability that most of the chips in the bag are black?

I have seen questions of this type before (a short rant, later), and so I was not caught off-guard. (Sidenote: Puzzles and jokes aren't quite the same when you hear them the second time) The first question I answered 1 trillion (actual answer is 650,000, the DJIA is currently around 12,000), and the second one I said 0.45 and 0.96 (thanks to Bayes theorem), which are the correct answers.

Typically, people guess much smaller than 650,000 for Q1, and for the second part of Q2, people typically guess between 45-75%.

To be perfectly honest, for Q1, I had two numbers in mind. I thought of 1 trillion as the "correct answer" (based on having seen the type of questions before), and a much lower (and wrong) 100,000 as a plausible tight upper-bound. So in some sense I flunked Q1.

Here's my rationalization.

The disturbing part is the "90% confidence", which probably implies that if one was asked such questions 10 times, one should flunk once on average. The problem with gross overestimations (like 1 trillion) are that the odds of getting that occasional (required) wrong answer diminish.

Monday, March 7, 2011

CiteULike: Export and clean BiBTeX file

As mentioned in a previous post, CiteULike lets you export citations in BiBTeX format, which is useful for including it in LaTeX documents. However, the BiBTeX entries it produces, contain a lot of metadata that I like to filter out.

One could do this manually, which is fine, but many of the cleaning operations are routine. As any self-respecting Linux user would say, it would be nice if one could automate at least part of the clean up process.

For example, a typical BiBTeX entry that CiteULike produces looks like:

@article{newman01,
abstract = {{We describe in detail an efficient algorithm for studying site or bond percolation on any lattice. The algorithm can measure an observable quantity in a percolation system for all values of the site or bond occupation probability from zero to one in an amount of time that scales linearly with the size of the system. We demonstrate our algorithm by using it to investigate a number of issues in percolation theory, including the position of the percolation transition for site percolation on the square lattice, the stretched exponential behavior of spanning probabilities away from the critical point, and the size of the giant component for site percolation on random graphs.}},
archivePrefix = {arXiv},
author = {Newman, M. E. J. and Ziff, R. M.},
citeulike-article-id = {3373958},
citeulike-linkout-0 = {http://arxiv.org/abs/cond-mat/0101295},
citeulike-linkout-1 = {http://arxiv.org/pdf/cond-mat/0101295},
citeulike-linkout-2 = {http://dx.doi.org/10.1103/PhysRevE.64.016706},
citeulike-linkout-3 = {http://link.aps.org/abstract/PRE/v64/i1/e016706},
citeulike-linkout-4 = {http://link.aps.org/pdf/PRE/v64/i1/e016706},
day = {8},
doi = {10.1103/PhysRevE.64.016706},
eprint = {cond-mat/0101295},
journal = {Physical Review E},
keywords = {cluster, fast, percolation},
month = {Jun},
number = {1},
pages = {016706+},
posted-at = {2011-01-12 21:22:08},
priority = {2},
publisher = {American Physical Society},
title = {{Fast Monte Carlo algorithm for site or bond percolation}},
url = {http://dx.doi.org/10.1103/PhysRevE.64.016706},
volume = {64},
year = {2001}
}

Typically, I like to get rid of all the "citeulike" tags, and irrelevant metadata such as priority, abstract etc. Additionally, I like to abbreviate journal titles, and replace author names with initials if necessary.

That is quite a bit.

I wrote a quick and dirty "sed" script, called "clean_citeulike" as below:

s/Physical Review/Phys. Rev./

s/Journal of Rheology/J. Rheol./

s/Rheologica Acta/Rheol. Acta/

s/The Journal of Chemical Physics/J. Chem. Phys./

s/Computer Physics Communications/Comp. Phys. Comm./

s/Macromolecular Theory Simulation/Macromol. Theory Simul./

/author/ s/, \([A-Z]\)[a-z]* /, \1. /g

/author/ s/, \([A-Z]\)[a-z]*}/, \1.}/g

/citeulike/ d

/keywords/ d

/posted-at/ d

/priority/ d

/publisher/ d

/abstract/ d

/month/ d

/url/ d

/day/ d

/issn/ d

Now, all I need to do is run the program on the "bib" file (which I call in.bib) according to

$ sed -f clean_citeulike in.bib

and I get something that looks like:

@article{newman01,
    archivePrefix = {arXiv},
    author = {Newman, M. E. J. and Ziff, R. M.},
    doi = {10.1103/PhysRevE.64.016706},
    eprint = {cond-mat/0101295},
    journal = {Phys. Rev. E},
    number = {1},
    pages = {016706+},
    title = {{Fast Monte Carlo algorithm for site or bond percolation}},
    volume = {64},
    year = {2001}
}

Not perfect perhaps, but much simpler to clean manually.

Friday, March 4, 2011

The typical person

If you haven't seen this yet, this National Geographic video is great (thanks FlowingData).

Wednesday, March 2, 2011

Undergrad Summer Internships: Email Spam?

It is funny that I received about half a dozen generic internship emails, since a recent post "Indian undergrads' internship request e-mails" appeared on nanopolitan. If you read the discussion around the topic (comments and links), it is clear that there are two categories of potential interns, the serious and the not-so-serious.

Let's dismiss the not-so-serious, and consider only the serious. At one point in time, I was one of them.

At that time, I was strongly conflicted between going to grad school or industry, with a bias towards the latter. Email was not as ubiquitous then (I think we had a quota of 300 extramural emails per semester, or something like that). Luckily, I spent an extremely enjoyable summer at HLRC, working on a computational project, after my sophomore year. I spent the following summer at an aromatics plant in Taloja. By the end of the second internship, it was clear to me that I wanted to go to grad school.

Clearly, I am indebted to those internships.

But now, my role has switched. I look at the process from an institutional standpoint. What did the institutions that gave me interships gain? My guess is that it let them "check me out" for possible future employment. In many ways, this is an efficient way to hire. The actual work was probably lost, shortly after I left.

From the standpoint of both, the student and the institution, the actual work accomplished over the internship is less important than the effect of that experience on future career/employment choices.

With that, let me get back to the "internship emails" I typically receive from IITians who want to work for about "8-12 weeks". The only motivation for a professor (in the US), that I can conjure, would be if he/she could recruit a good student for a subsequent MS or PhD. Anecdote is not data (and personal anecdote, less so), but I haven't seen a single internship prospect of this kind, materialize into a grad student. Surely counter-examples can be supplied, but if it were a great hiring tool, I suspect that it would naturally have been used more extensively (like it still is in industry).