Friday, April 30, 2010

A sed one-liner for LaTeX

Sometime in the near past, I posted my regex cheat sheet. With "sed" you can make regex work. For those not in the know:
sed is a stream editor. A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline). While in some ways similar to an editor which permits scripted edits (such as ed), sed works by making only one pass over the input(s), and is consequently more efficient. But it is sed's ability to filter text in a pipeline which particularly distinguishes it from other types of editors.
But that doesn't really tell you much. Here is an excellent tutorial by Bruce Barnett. Here are explanations to some great one liners.

Today, I had to do something to a LaTeX manuscript, I have been working on for a few months: I finally made up my mind on which journal to send it to. While LaTeX in conjunction with natbib and BiBTeX is great for re-formatting, there was one issue that needed to be weeded out.

I wrote the manuscript using Nature citation style (compressed numerical superscripts) by using \cite commands after punctuation marks like commas and periods. The new place I wanted to send it to required author-year formatting before punctuation marks. Basically transform something like
This is known from previous studies.1-3
to,
This is known from previous studies (Smart and Boss, 1990; Smarter et al., 2005; Smartass 2011).

Normally, this would be a fairly tedious manual exercise. But sed reduces the task into a one liner.


sed 's/\([,.;:]\)\\cite{\([A-Za-z0-9, ]*\)}/ \\citep{\2}\1/g' doc0.tex > doc1.tex

Let me parse the command.

sed 's/xxx/yyy/g' doc0.tex > doc1.tex

substitutes (the s/) all matches (the /g' at the end for global substitution) of "xxx" in the file doc0.tex with "yyy" and writes it into a new file doc1.tex. So far it is a simple find and replace, which could have been done easily in most editors.

The real magic is the "xxx" and "yyy" in the original command. The [,.;:] matches any of the usual punctuation marks that precede a \cite declaration. Parenthesis \ ( and \ ) mark stuff to be remembered and \1, \2 etc puke it out. Thus,
echo abcd123 | sed 's/\([a-z][a-z]*\).*/\1/'
abcd

The \\ is required to protect the \, which has special meaning. The rest is plain old regex matching.

No comments: