Monday, 23 November 2009

Twittering OCR

'things that go bump in the tunnel'

As I've beem following the LHC restart I've written a parser for the vistar status feed to send it to twitter. The basic method is:

Grab URL (see image) then do some imagemagick hackery to cut out the corner. Resize larger (helps with the OCR) and save as tiff. Run the image through OCR software, compare the output to the last run, if different then upload to twitter.


curl -o $IMG $SRC
convert $IMG +repage -crop 509x205+1+533 -resize 1000x -threshold 39000 $IMG
convert -monochrome $IMG $TIFF
mv $OUT.txt $OUT.old # make a backup of old
tesseract $TIFF $OUT

# Strip out ready for Twitter
DATE=`date +%d-%m-%Y`
sed -i "s/Comments $DATE /#LHC Status /" $OUT.txt

diff -q $OUT.txt $OUT.old
if [ $? -eq 1 ] ; then
# Post to Twitter.
curl --basic --user lhcstatus:password --data status="`cat $OUT.txt`"

and lo:

Plotting Lustre MDS stats

At $dayjob we have several large filesystems - for example our /scratch system has 3.1 PB of space using over 1000 HDDs. Although each vendo...