Monday, March 7, 2011

CiteULike: Export and clean BiBTeX file

As mentioned in a previous post, CiteULike lets you export citations in BiBTeX format, which is useful for including it in LaTeX documents. However, the BiBTeX entries it produces, contain a lot of metadata that I like to filter out.

One could do this manually, which is fine, but many of the cleaning operations are routine. As any self-respecting Linux user would say, it would be nice if one could automate at least part of the clean up process.

For example, a typical BiBTeX entry that CiteULike produces looks like:

@article{newman01,
abstract = {{We describe in detail an efficient algorithm for studying site or bond percolation on any lattice. The algorithm can measure an observable quantity in a percolation system for all values of the site or bond occupation probability from zero to one in an amount of time that scales linearly with the size of the system. We demonstrate our algorithm by using it to investigate a number of issues in percolation theory, including the position of the percolation transition for site percolation on the square lattice, the stretched exponential behavior of spanning probabilities away from the critical point, and the size of the giant component for site percolation on random graphs.}},
archivePrefix = {arXiv},
author = {Newman, M. E. J. and Ziff, R. M.},
citeulike-article-id = {3373958},
citeulike-linkout-0 = {http://arxiv.org/abs/cond-mat/0101295},
citeulike-linkout-1 = {http://arxiv.org/pdf/cond-mat/0101295},
citeulike-linkout-2 = {http://dx.doi.org/10.1103/PhysRevE.64.016706},
citeulike-linkout-3 = {http://link.aps.org/abstract/PRE/v64/i1/e016706},
citeulike-linkout-4 = {http://link.aps.org/pdf/PRE/v64/i1/e016706},
day = {8},
doi = {10.1103/PhysRevE.64.016706},
eprint = {cond-mat/0101295},
journal = {Physical Review E},
keywords = {cluster, fast, percolation},
month = {Jun},
number = {1},
pages = {016706+},
posted-at = {2011-01-12 21:22:08},
priority = {2},
publisher = {American Physical Society},
title = {{Fast Monte Carlo algorithm for site or bond percolation}},
url = {http://dx.doi.org/10.1103/PhysRevE.64.016706},
volume = {64},
year = {2001}
}

Typically, I like to get rid of all the "citeulike" tags, and irrelevant metadata such as priority, abstract etc. Additionally, I like to abbreviate journal titles, and replace author names with initials if necessary.

That is quite a bit.

I wrote a quick and dirty "sed" script, called "clean_citeulike" as below:

s/Physical Review/Phys. Rev./
s/Journal of Rheology/J. Rheol./
s/Rheologica Acta/Rheol. Acta/
s/The Journal of Chemical Physics/J. Chem. Phys./
s/Computer Physics Communications/Comp. Phys. Comm./
s/Macromolecular Theory Simulation/Macromol. Theory Simul./

/author/ s/, \([A-Z]\)[a-z]* /, \1. /g
/author/ s/, \([A-Z]\)[a-z]*}/, \1.}/g

/citeulike/ d
/keywords/ d
/posted-at/ d
/priority/ d
/publisher/ d
/abstract/ d
/month/ d
/url/ d
/day/ d
/issn/ d

Now, all I need to do is run the program on the "bib" file (which I call in.bib) according to

$ sed -f clean_citeulike in.bib

and I get something that looks like:

@article{newman01,
    archivePrefix = {arXiv},
    author = {Newman, M. E. J. and Ziff, R. M.},
    doi = {10.1103/PhysRevE.64.016706},
    eprint = {cond-mat/0101295},
    journal = {Phys. Rev. E},
    number = {1},
    pages = {016706+},
    title = {{Fast Monte Carlo algorithm for site or bond percolation}},
    volume = {64},
    year = {2001}
}

Not perfect perhaps, but much simpler to clean manually.

1 comment:

Anonymous said...

I appreciated this blog! Keep up the good work, I like your writing. I have gotten some good information here.

www.apguide.com