Monday, August 13, 2012

TeXcount: Number of words in a LaTeX document

Since TeX is really a markup language, counting the number of words in a document is tricky. Obviously, you don't want to literally count words in tags like \chapter{}, \begin{center}, \cite{reference1, reference2} etc.

You may have macros, which need interpretation.

You may have external files that you are collecting together in a master document by using \input{} etc.

In short, it is not as simple as it seems.

You could try to use front-end programs like kile or TeXShop which will give you a simple total count. My front-end program of choice --- TeXMaker --- does not do it for me.

If your document is very simple, you could try to "detex" the LaTeX tags, and use a simple Linux utility like "wc".

The best solution seems to be TeXcount.

There is a web-interface that lets you paste your TeX document in a web-form.

Alternatively you can download the script. It is essentially a small perl program (400kB download in all, the actual script is about 90kB) called texcount.pl, which you can run quite simply as

perl texcount.pl filename.tex

Here's the form (default) output it spits out


Encoding: ascii
Words in text: 10324
Words in headers: 81
Words in float captions: 219
Number of headers: 30
Number of floats: 5
Number of math inlines: 198
Number of math displayed: 18
Subcounts:
  text+headers+captions (#headers/#floats/#inlines/#displayed)
  14+9+0 (1/0/0/0) _top_
  89+1+0 (1/0/0/0) Section: Introduction
  419+2+44 (1/1/0/0) Subsection: Analytical Rheology
  646+2+38 (2/1/3/0) Subsection: Polymers
  236+3+0 (1/0/0/0) Subsection: Scope and Organization
  35+3+0 (1/0/0/0) Section: Motivation and Background
  355+2+19 (1/0/2/0) Subsection: Linear Polymers
  657+2+24 (1/1/17/1) Subsection: Branched Polymers
  205+3+45 (1/1/0/0) Subsection: Model-driven Analytical Rheology
  97+6+0 (1/0/0/0) Section: Models for Polymer Dynamics and Rheology
  597+2+0 (1/0/2/0) Subsection: Historical Development
  872+5+25 (2/1/8/0) Subsection: The Tube Model
  211+4+0 (1/0/0/0) Subsection: State of the Art
  273+2+0 (1/0/5/0) Subsection: Computational Models
  162+3+0 (1/0/0/0) Section: Methods and Progress
  1774+5+0 (3/0/108/15) Subsection: Linear Polymers
  2955+24+24 (9/0/51/2) Subsection: Branched Polymers
  727+3+0 (1/0/2/0) Section: Summary and Perspective


You can exercise significant control over the way it parses the document and reports the results by using options that are described in the manual.

No comments: