Sunday, February 27, 2011

How to work on the Nth column of a file with awk?

Consider the following awk script called stat.awk:
#
# AWK SCRIPT stat_general.awk does statistics... 

# prints out arithmetic mean, standard deviation,
# standard error of column 1
#

BEGIN { n = 0; s = 0; ss = 0;}

  n++;
  s += $1;
  ss += $1 *$1;
}
END {
   print n " data points"
   m = (s+0.0)/n; print m " average"
   sd = sqrt( (ss - n * m * m) / ( n - 1.0))
   print sd " standard deviation"
   se = sd/sqrt(n)
   print se " standard error"
}



$ awk -f stat.awk


This prints out the statistics of column 1.

What if we wanted to find the statistics of an arbitrary column ColumnNum (especially if you want to wrap it in a bash script for repeated operation)? It is reasonably straightforward to modify the script above to stat_general.awk

#
# AWK SCRIPT stat_general.awk does statistics... 
# prints out arithmetic mean, standard deviation,
# standard error of an arbitrary column
#

BEGIN { n = 0; s = 0; ss = 0;}
  n++;
  s += $ColumnNum;
  ss += $ColumnNum *$ColumnNum;
}

END {
   print n " data points"
   m = (s+0.0)/n; print m " average"
   sd = sqrt( (ss - n * m * m) / ( n - 1.0))
   print sd " standard deviation"
   se = sd/sqrt(n)
   print se " standard error"
}


$ x=5
$ awk -f stat_general.awk -v ColumnNum=$x

Presto!

Thursday, February 24, 2011

ALGLIB: New Kid on the Block?

I just started using ALGLIB for a somewhat large optimization problem, and so far, I am loving it. From the wikipedia entry:
ALGLIB is a cross-platform open source numerical analysis and data processing library. It is written in specially designed pseudocode which is automatically translated into several target programming languages (C++, C# and other). ALGLIB is relatively young project - active development started only in 2008, while GSL, for example, has 14 years long history. However, it is actively developed with new releases every 1–2 months.
I use the C++ routines, since I just got tired of using plain-C from the older but reliable GSL.

ALGLIB has more than just an interesting distribution mechanism. Unlike traditional numerical libraries, you "include" ALGLIB routines that you need directly, making installation a non-issue.

So far most of my use has been confined to a couple of packages, but I like what I see so far.

Saturday, February 19, 2011

CiteULike: Making Life Simpler

I've exclusively been using CiteULike to keep my bibliography straight, for almost four years now. If I had to make an ordered list of the most important web utilities (to me), it would probably float near the top.

It really has simplified a recurring feature of my Life.

Here is a YouTube video, which showcases some of its features.

Personally, I like it for the following reasons:
  1. you can easily add a new reference to your library
  2. you can tag, classify, and search for a paper in your library very quickly
  3. you can add personal notes to a paper
  4. you can upload a personal pdf of the paper to go with the reference
  5. you can upload an annotated pdf (multiple pdfs are allowed), for papers you keep going back to again and again
  6. you can export the citation in BiBTeX format
  7. you can share papers with your group, or collaborators
  8. you can use the "recommended papers" feature (I have not used this much)
  9. It is free
Since one can archive papers online, they can easily be accessed from anywhere, and from any machine. For me, it has completely supplanted JabRef, which is a darn good program itself.

Wednesday, February 16, 2011

How is that journal title abbreviated?

Every once in a while you have to cite a paper from an unfamiliar journal. It can be frustrating not knowing how the journal title is abbreviated, although the internet has reduced that pain somewhat.

You can find how ISI abbreviates all the journal it indexes here. You can browse according to first letter, and use the search feature in your browser to find what you are looking for.

Recently, for example, the list helped me figure out that "IEEE Transactions on Visualization and Computer Graphics" was abbreviated as "IEEE T. Vis. Comput. Gr."



Friday, February 11, 2011

Big Brother is Watching?

My little town of Tallahassee now has video cameras installed at many major intersections. A nearly-live feed (refreshed every two minutes) is publicly available.

I must admit that I found this creepy at first. But it has had a practical benefit.

By monitoring traffic at a particularly nasty intersection (Mahan-Capital Circle) on my way back home from work, it helps me plan when to leave.

Tuesday, February 8, 2011

Tokenize bash variables

Here is a useful set of bash features that I had to use today.

From the source linked above:

Given, foo=/tmp/my.dir/filename.tar.gz

We can use bash expressions to tokenize or extract different portions of the variable.

path = ${foo%/*} (/tmp/my.dir)
file = ${foo##*/} (filename.tar.gz)
base = ${file%%.*} (filename)
ext = ${file#*.} (tar.gz)

This gives us four combinations for trimming patterns off the beginning or end of a string:
${variable%pattern}: Trim the shortest match from the end
${variable##pattern}: Trim the longest match from the beginning
${variable%%pattern}: Trim the longest match from the end
${variable#pattern}: Trim the shortest match from the beginning

Saturday, February 5, 2011

More links

1. Tall, Grande, Venti, and now Starbucks presents Trenta (via FlowingData). Super-duper-size me!

2. Complex conjugates are funny! I could actually visualize one of my old teachers doing something like this :).

3. Timing is everything in the stock market (NYT, again via FlowingData)? I wonder why I've never seen a lucid graphic like this before.

Wednesday, February 2, 2011

Engauge Digitizer: Extract Freely

Engauge Digitizer is an "open source, digitizing software which converts an image file showing a graph or map, into numbers. The image file can come from a scanner, digital camera or screenshot." If you are using Linux and have a pdf of the figure you want to digitize, you can scan it in from the command line by using

import pic.png

and defining the picture bounds.
  1. Start Engauge.
  2. "Import" the PNG picture.
  3. Define the axes (log or linear) using three points.
  4. Scan in points or lines, either manually or automatically.

Once you have selected all the points you need, you can "export" the file into a "data.csv" file.

Typically this file looks something like:

x,Curve5
21.4764,59.9803
26.2048,92.1999
31.9743,163.566
45.5445,319.264
58.0845,527.22
...

To remove the first line, and to replace the commas with spaces/tabs and do other superficial dressing up, we use a simple terminal based command:

more +2 data.csv | awk -F, '{printf("%e\t%e\n",$1,$2)}'

The +2 ignores the first line, and the -F, specifies that the field separator is a comma. This immediately transfroms data.csv into a new file which looks like:

2.147640e+01    5.998030e+01
2.620480e+01    9.219990e+01
3.197430e+01    1.635660e+02
4.554450e+01    3.192640e+02
5.808450e+01    5.272200e+02

...

Done!

This is especially useful, if you are extracting a bunch of data from the same graph, and then cleaning them up at once.

PS: I used to use a Mac-based program called DataThief to do similar stuff in the past. I notice that it has since become shareware.


Tuesday, February 1, 2011

Gray and Dirty Swans!

My opinion of Nicholas Nassim Taleb's "The Black Swan" is not particularly charitable. But you know, this guy put his philosophy into practice by running a hedge fund and making billions of dollars for himself and his clients. What do you say about that, huh?

I did not know this while I was reading the book, but it turns out that the magnitude of his exploits was  "somewhat" misleading.

NNTs investment strategy is very crudely speaking the opposite of the insurance business. That is, you continuously bleed money, with the expectation of making monster sized gains when an extremely low probability event (a Black Swan) occurs. The rationale behind the strategy is that the probabilities "we" assign to low-probability events (could be zero) are usually lower than warranted. That is, there is an  asymmetry between expected and empirical probabilities.

As Janet Tavakoli puts it:
The black swan fund's strategy is purportedly to buy out-of-the-money put options on stocks and broad market indices and hedge tail risk for clients. The strategy may produce long periods of mediocre--or even negative--returns followed by a large gain and vice versa. No one can tell you for certain exactly when (or for how long) large gains are possible.
She reports here that:
Taleb’s Empirica Kurtosis “black swan” fund had negative returns in 2001, the year of the 9/11 black swan event. Taleb later claimed he only called it a hedge fund “in May-Oct 2001.” Perhaps he meant something else, because Empirica Kurtosis wound up at the beginning of 2005 with lackluster returns , and performance specifics are not public, but it may have been a stranded swan.
She also discovered that when he was quoted as saying the following (in the aftermath of the present crisis), in an article in GQ fawning over his intellectual prowess:
I went for the jugular--we went for the max. I was interested in screwing these people--I'm not interested in money, but I wanted to teach them a lesson, and the only way you can do it is by trying to take it away from them. We didn't short the banks--there's not much to be gained there, these were all these complex instruments, options and so forth. We'd been building our positions for a while...when they went to the wall we made $20 bln for our clients, half a billion for the Black Swan fund.
he was actually being borderline deceitful. Upon being pressed NNT admitted that he really had made about 250-500 million dollars, and not the 20 billion which was his notional exposure.

Sure, whats a factor of 40 or 80 between friends?

But then, Jim Rogers, caught him on that as well. Making 0.25-0.5B on a 20B exposure is a 1-2% return, not the super-sized return you would expect for all the waiting and bleeding you've been doing.

In the linked articles, Janet Tavakoli also attacks his claim of having been one of the first to foresee the present crisis. What does she get in return? This hilarious stuff! He puts a big yellow post it note on the GQ story as posted on his webpage:
Note that NUMBERS are wrong. This is not a business/finance, but a philosophy article written by Will Self. So read the article for its ideas. Janet Tavakoli used the errors as a platform for her (failed) smear campaign.

I have very, very stupid enemies.
As Tavakoli says "he plays the victim and resorts to unwarranted name calling when asked legitimate questions."