Wednesday, July 18, 2012

Splitting a large text file by number of lines and tags

Say you have a big file (text, picture, movie, etc) and you want to split it into many small parts. You may want to do this to email it to someone in more manageable chunks, or perhaps analyze it using a program that cannot handle all of the data at once.

The Linux command split lets you chop your file into chunks of specified size. To break a big file called "BigFile.mpg" into multiple smaller chunks "chunkaa", "chunkab" etc. you say somthing like.

split -b 10M BigFile.mpg chunk

Consider a simpler case, where the big file is a text file. For concreteness assume that BigFile.txt looks like:

# t = 0
particle1-coordinates
particle2-coordinates
...
particleN-coordinates

# t = 1
particle1-coordinates
particle2-coordinates
...
particleN-coordinates
...
# t = tfinal
particle1-coordinates
particle2-coordinates
...
particleN-coordinates

You may generate one of these, if you are running a particle-based simulation like MD, and printing out the coordinates of N particles in your systems periodically. For concreteness say N = 1000, and tfinal = 500.

If this file were too big, and you wanted to split it up into multiple files (one file for each time snapshot) then you could still use the split command as follows

split -l 1002 BigFile.txt chunks

The 1002 includes the two additional lines: the time stamp and the blank line after the time snapshot.

You can also use awk instead, and use the fact that the "#" tag demarcates records

awk 'BEGIN {c=0} /#/{next; c++} {print > c ".dat"}' BigFile.txt

would do something very similar. It would match the "#" tag and create files 0.dat etc. containing the different time-stamps. The advantage of this method is that you have more flexibility in naming your chopped pieces, and you don't have to know the value of "N" before-hand.

Finally, say you wanted to create chopped pieces in a different way. Instead of chopping up timestamps, you wanted to store the trajectories of individual particles in separate files. So while the methods above would have created 500 files with 1000 (+2) lines each, you now want to create 1000 files with 500 lines. One of the easiest ways is to use sed.

sed -n 1~10p  prints every tenth line starting with the "1"st line. You can use this to write a simple shell script.

npart=1000;
ndiff=$((npart + 2))
n=1;
while [ $n -le $npart ]
do
  nstart=$((n+1))
  sed -n $nstart ~ $ndiff'p' rcm > $n.dat
  n=$((n + 1))
done

Note the single quotes around "p" in the line containing the sed command.

No comments: