Thursday, July 9, 2009

Combining data from independent simulation runs using a bash script

Today I came across a problem that I have solved several times before. From my simulations, I generate a bunch of files called stat1, stat2, ... statN, which contain the following data:

$cat stat1
567.20 0.88
45.29 3.08
296.58 21.50
0.33 0.14

The first column are some properties in a particular simulation run, and second column is the standard error. The "N" different "stat" files are N independent simulation runs. When I finally report, I like to report the average properties and associated standard errors. The following shell script DataAgg.sh creates a new file TotalProp which contains exactly that.

$cat TotalProp
567.49 0.24
43.57 0.45

289.91 1.61
0.67 0.10

The shell script is here:

$cat DataAgg.sh

i=0
for s in stat*
do

let i
=i+1

if [ $i == 1 ]; then
awk '{
print $1}' $s > TmpProp
awk '{
print $2*$2}' $s > TmpErr2Prop
else
awk '{
print $1}' $s > tmp
paste tmp TmpProp
> more
awk '{
print $1+$2}' more > TmpProp

awk '{
print $2}' $s > tmp
paste tmp TmpErr2Prop
> more
awk '{
print $1+$2}' more > TmpErr2Prop
fi
done

awk '{
print $1/n}' n=$i TmpProp > more; mv more TmpProp
awk '{
print sqrt($1)/n}' n=$i TmpErr2Prop > more; mv more TmpErr2Prop
paste TmpProp TmpErr2Prop
> more
awk '{printf
("%6.2f\t%6.2f\n",$1, $2)}' more > TotalProp

rm -f TmpProp
rm -f TmpErr2Prop
rm -f more
rm -f tmp

Note I don't need to know how many "stat"s there are, and how many rows each of the "stat"s has. The only precondition is that I know what the common prefix ("stat") of my datafiles is, and that those files contain only the two numerical columns mentioned above.

No comments: