Saturday, January 29, 2011

Review: The Black Swan - the impact of the highly improbable

Having read some great reviews that this book has attracted,  it was with much delight and anticipation that I finally borrowed this this best-seller by Nicholas Nassim Taleb.

Instead of saving the summary for the end, I am going to spit it out right at the beginning.

I am glad I picked this book up from the library.

It is the kind of book that you could read once. In fact, a good summary could probably distill the essence of the book into two pages, without sacrificing much.

Once upon a time people thought all swans were white. Everything was hunky dory, until black swans were discovered in Australia. The book derives its title from this "highly improbable event". I don't know how many tectonic plates shifted after this discovery, but Taleb does make an important point.

Unfortunately, the point has been made before, and it has been made before even more eloquently. For example Einstein is alleged to have said, "No amount of experimentation can ever prove me right; a single experiment can prove me wrong."

That is really my primary grouse with the book. Most of it has been said before more eloquently.

But the book is not without its merits.

Taleb lays down the three characteristics of a Black Swan: rarity, extreme impact, and retrospective predictability.

Events like 9/11, the rise of the internet, the market crash of 1987, Harry Potter - they are all Black Swans.

By rarity he means power-law rarity, not a six-sigma type of rare event. From the linked wikipedia article
The distribution of a wide variety of natural and man-made phenomena follow a power law, including frequencies of words in most languages, frequencies of family names, sizes of craters on the moon and of solar flares, the sizes of power outages, earthquakes, and wars, the popularity of books and music, and many other quantities.
Power law distributions are scale invariant - they don't have a mean or variance, like most commonly studied distributions do. They aren't widely used in economics, and perhaps they should be, but again Taleb is not the first person to make this point. Mandelbrot did it much earlier and much better.

Extreme impact is easy to understand. Unlike height, variables that obey power law distributions can wreak havoc. If Bill Gates joins a group the average height (a normally distributed variable) does not change appreciably, but the average income goes berserk. Taleb calls the former Mediocristan, and the latter Extremistan which may be useful labels for some people. To me they were more unnecessary pompous verbiage (like Ludic fallacy, scalable, nonlinearity), that the author loves to throw around at the slightest provocation.

Narrative fallacy or retrospective predictability is the tendency of experts and media to explain unpredicted events after the fact, as if they were inevitable. I have to confess that this was the only chapter in the book that I found interesting, because I had not heard it discussed so fully before.

Where the book gets boring is when the author goes off on tangents.

Usually I like tangents. I love footnotes. But when tangents take place with reckless abandon, it results in a chaotic laser show.

But that is not all. The book gets viscerally painful, when he author stoops to cheap shots. To be fair, he may have been hurt by those who he seeks to hurt, but true generosity is the mark of a great man. For example on page 90, he rants:
Where it gets painful is when you see one of your peers, whom you despise, heading to Stockholm for his Nobel reception.
If this was an isolated sentence, it could be easily forgiven. After all, we are all works in progress. But the book is soaked in that kind of "small-think". Another annoying habit is the  castigation of the bell curve ("that great intellectual fraud"), ad nauseam. The bell curve is often inappropriately applied - we get it! But hearing Taleb go on and on, I almost felt like taking that bell curve and banging it on his head so that he would stop.

It is surprising that I found the book utterly disappointing. I actually share quite a bit of Taleb's worldview. I agree that most economists have bogus predictive abilities, that not listening to popular media is a recipe for knowledge, and that it is better to be approximately correct than precisely wrong.

As I said: I am glad I picked this book up from the library.

Saturday, January 22, 2011

Numerical Software: John Burkardt

One of my colleagues, John Burkardt, has a truly amazing catalogue of numerical routines in several different languages (C, C++, Fortran77, Fortran90, Matlab etc.). You can tell that he has nurtured the subroutines with a collector's zeal.
Over the years, I have collected, modified, adapted, adopted or created a number of software packages in FORTRAN. You might be able to use one of these libraries, or a routine or two from a library.
His collection is especially impressive, because the subroutines, algorithms, sources and drivers, are extremely well-documented. Essentially, if you need something in a certain language, and you find it in John's collection, then incorporating his subroutines is child's play.

The best way to find if a certain routine is available is to google it, together with his name.

Thursday, January 20, 2011

Links: A couple of questions

Personally relevant questions:

1. How to write good code (xkcd)?.
2. Why am I always stuck in the slowest lane (FlowingData)?

Saturday, January 15, 2011

The truth wears off?

Here is an interesting article from The New Yorker. An excerpt from the article (emphasis mine):
Leigh Simmons, a biologist at the University of Western Australia, suggested one explanation when he told me about his initial enthusiasm for the theory: “I was really excited by fluctuating asymmetry. The early studies made the effect look very robust.” He decided to conduct a few experiments of his own, investigating symmetry in male horned beetles. “Unfortunately, I couldn’t find the effect,” he said. “But the worst part was that when I submitted these null results I had difficulty getting them published. The journals only wanted confirming data. It was too exciting an idea to disprove, at least back then.” For Simmons, the steep rise and slow fall of fluctuating asymmetry is a clear example of a scientific paradigm, one of those intellectual fads that both guide and constrain research: after a new paradigm is proposed, the peer-review process is tilted toward positive results. But then, after a few years, the academic incentives shift—the paradigm has become entrenched—so that the most notable results are now those that disprove the theory.

Thursday, January 13, 2011

Histogram Puzzle: A faster solution?

Here is the problem statement. The brute-force solution is O(N^2).

Here I outline a faster O(N log N) solution. There may be an even better way to do it, but usually, when I get to O(N log N), I am happy, and move on to other things.

Given a histogram, the first thing we do is locate any columns that have zero height. Such columns punctuate the overall histogram, and can be used to break a big problem, into a series of smaller ones (smaller 'N').

We then work with a (piece of the) histogram that has no zero height columns, such as the one illustrated before (and below), for concreteness.

Also, for clarity, I will perform some additional steps, which need not be explicitly implemented.
As before, we initialize,
maxRect = max{h(i)}     # max.Rect.area() = 9, here

Next, we quicksort the column heights. This is an O(N log N) operation. I also remember the position of a given column in the original histogram. In the illustration above, I get a sorted array which looks something like:

1  7
2  1
2  5
3  6
3  2
3  3
4  8
9  4

where the first column represents h(i), and the second column denotes the location in the original histogram.

We look at the first row with height 1, and here's the key insight.
The biggest rectangle that can be constructed in which this column (h=1) participates is 1 * N = 8. Since this is smaller than MaxRect.area() (=9), this column is not part of the solution.

So we throw it away.

But we throw it away in a unique fashion.

We delete the magnitude of this column from every other column. This is what happens to the remaining histogram.
We have now introduced additional zero height columns. We use this to break our problem further, and examine the pieces punctuated by the column that just disappeared.

One of those pieces (column N has height 4) is useless, and we throw it away, and focus on the other piece.

Rinse, dry, repeat.

The rest of the strategy should be reasonably clear, from this point on.

During each step, we decide whether a (many) column(s) can be part of the biggest rectangle. This is O(N). In practice, this may be even faster (?), because we break the big problem into smaller problems. The "N" is not the same.

Friday, January 7, 2011

Histogram Puzzle: Brute Force

I am going to outline the brute force method of solving the histogram puzzle. This is not a terribly smart way to solve the problem, but at least it forces us to parse the problem.

As I will comment later, it is not completely without merits, either.

Let "i" and "j" be two indices which run from 1 to N. Given a pair (i, j), a unique rectangle (with the maximum possible area, given the indices) that is inscribed within the histogram can be identified.

In the figure above, if i = 2, and j = 3, this rectangle has dimensions 2*4. Similarly if i = 3, and j = 6, this rectangle has dimensions of 4*2. Or more generally (j - i + 1) * MinHeight(i, j), where MinHeight(i, j) is the minimum height of columns between i and j.

If maxRect represents the rectangle with the maximum area, h(i) the height of column i, then one can write an algorithm that looks something like:

maxRect = max{h(i)}     # initialize = tallest column

for i = 1 to N-1

  minHeight(i) = h(i)   # between i and current "j"

  for j = i+1 to N

    minHeight(i) = min(h(j), minHeight(i))
    area         = (j - i + 1) * minHeight(i)

    if(area > maxRect.area)
      maxRect = Rect(i, j)   # rectangle(i,j)
    endif

  endfor

endfor

This is an order O(N^2) algorithm, which is not terribly impressive. We can do much better, at least O(N log N).

However, before we write this off as a bad algorithm, I want to point out three redeeming features.

One, its memory footprint is small. Two, the algorithm is eminently "parallelizable". And three, it is simple to understand and maintain.

So in a world with different constraints, it may still have a chance.

Thursday, January 6, 2011

Does that place have free WiFi?

I found this nice site, which tells you where you can find (airports, restaurants etc.) free WiFi. Found is a somewhat misleading word in this context, since it shows up as the top hit when you google "free wifi" :).

In any case, this list of free WiFi at airports is particularly useful, while traveling.

Here's a list of free local spots in Tallahassee.

All Saints Cafe - 903 Railroad Avenue - (850) 224 0805
All Tune and Lube - 2764 W Tennessee St - 850 574 1900
Atlanta Bread Co. - Governor's Square Blvd.
BagelHeads - 1935 Apalachee Pkwy - (850) 574-1122
Bagel Bagel Cafe - 2030 Apalachee Parkway - (850) 877-0111
Barnacle Bill's Seafood Rest. - 1830 N Monroe St - 850 385 8734
Betton Hills Coffee & Tea - 2086-B Thomasville Rd. - 850-386-9800
Buffalo Wild Wings - 1921 W Tennessee St - 850 425 5293
Calico Jack's Seafood House - 2745 Capital Circle NE - 850 385 6653
Cool Grindz Coffee - 1700 N. Monroe St. #17 - (850) 561-3600
Crispers Fresh Salads & Such - 1241 Apalachee Parkway - (850) 656-4222
Qdoba Mexican Grill - 1350 W Tennessee St. - 850-222-3334
Qdoba Mexican Grill - 1594 Governors Square Blvd. - (850) 671-3334
Tallahassee Memorial HealthCare - 1300 Miccousukee Rd
The Bada Bean, Tea & Coffee Co. - 2500 Apalachee Pkwy, #B - 850 562 2326
The Coffee Pub - 1122 Thomasville Rd - 850-425-5701
Tropical Smoothie Cafe - 3111 Mahan Drive, Ste 23 - 850 878 3777
Village Inn - 2531 Apalachee Pkwy - 850 877 8471
Village Inn - 3392 Lonnbladh Rd - 850 297 0053
Fast Trixx Powersports - 2386 Allen Road - 850-580-3278
Strozier Library FSU Campus - 23 Dogwood Way 850 644 5685
Tallahassee Progressive Center - 1720 S. Gadsden Street
Leroy Collins Leon County Public Library, Parkway Branch - 1210 Capital Cir Se - (850) 487-1926
YogaBerry - 800 Ocala Rd, Ste 150 - 850 575 2480

Histogram Puzzle

I came across an interesting "algorithmic" puzzle, through a colleague of my wife.

Suppose you have a histogram composed of "square tiles" as shown below:
There are N columns (N = 8 in this illustration), and each column has an integral number of tiles. The tallest column has P tiles (P = 9 in this illustration).

The problem is to find a rectangle that is inscribed within the histogram, that has the largest area. In the figure above, this particular rectangle has dimensions 6*2, and an area of 12. Note that there may more than one unique solution. In the figure above, there is another rectangle with dimensions 3*4 (area = 12), which is not outlined.

We want to design an algorithm that enables us to identify all the different inscribed rectangles with maximum area.

Saturday, January 1, 2011

Why do airplanes use zones to board passengers?

I just got back from a trip to San Diego. One of the (very few) great things about traveling with an infant, is that you get to board right after first and business class passengers are through. Even so, I made a mental note to try and learn something about boarding zones, after I got back.

Boarding an airline turns out to be a nontrivial optimization problem.

Clearly, an appropriate function to be minimized is total boarding time (with special consideration for "special" passengers, as a constraint). Key variables are the number and distribution of seats on the aircraft, the fraction of total seats occupied, and the stochastic nature of the boarding process itself.

It turns out that there is no clear winning formula, although some strategies are better than others. Here is a nice description, which animates some common strategies, and links to some prominent primary sources. An interesting picture from simulation studies (from that site):


Two interesting points. If the passenger inter-arrival time is more than 10 seconds, all the strategies seem to coalesce. Secondly, although reverse-pyramid is one of the more efficient choices, no commercial airline seems to use it. Instead, many airlines seem to prefer back-to-front which is among the more inefficient methods.

Why? I don't know the "real" answer, but it seems to me that people traveling together in groups typically choose adjacent seats. Reverse-pyramid or outside-in (United Airlines) would split them up.

Should modelers incorporate this criterion into their optimization models?