Monday, 23 November 2009

Twittering OCR


'things that go bump in the tunnel'

As I've beem following the LHC restart I've written a parser for the vistar status feed to send it to twitter. The basic method is:

Grab URL (see image) then do some imagemagick hackery to cut out the corner. Resize larger (helps with the OCR) and save as tiff. Run the image through OCR software, compare the output to the last run, if different then upload to twitter.

ie

curl -o $IMG $SRC
convert $IMG +repage -crop 509x205+1+533 -resize 1000x -threshold 39000 $IMG
convert -monochrome $IMG $TIFF
mv $OUT.txt $OUT.old # make a backup of old
tesseract $TIFF $OUT

# Strip out ready for Twitter
DATE=`date +%d-%m-%Y`
sed -i "s/Comments $DATE /#LHC Status /" $OUT.txt

diff -q $OUT.txt $OUT.old
if [ $? -eq 1 ] ; then
# Post to Twitter.
curl --basic --user lhcstatus:password --data status="`cat $OUT.txt`" http://twitter.com/statuses/update.json
fi

and lo: http://twitter.com/lhcstatus

1 comment:

Elwell said...

Minor update to the above - it now does a wc -m of the new message. if its over 140 chars it trims some stuff out.