Friday, 23 July 2010

more data musings

(the advantage of traveling by public transport once in a while is you can sit and faff on laptop)

More data musings

The only guarantee about user entered data is that, given enough entries it'll be inconsistent :-(

take for example an openstreetmap xapi query to pull out '/api/0.6/*[amenity=post_box]'

which is nice dataset of ~85k enties which I'll use for some simple analysis

So, the UK has ~40k postboxes, of which according to draco the breakdown of entries from the count are sources as follows:
13.5k - osm, 26.7k - website.

so of those 13504 UK postboxes in OSM, how many are royal mail run (hint - most of them!)
does the data match?

$ grep "operator" ~/Downloads/data.osm | sort | uniq -c | grep -i royal
1 <tag k='operator' v='Post Office: Royal Mail'/>
1 <tag k='operator' v='royal mail'/>
1 <tag k='operator' v='Royal mail'/>
5065 <tag k='operator' v='Royal Mail'/>
1 <tag k='operator' v='RoyalMail'/>
1 <tag k='operator' v='Royal MAil'/>
1 <tag k='operator' v='Royal Mail Warwick'/>
2 <tag k='operator' v='Royal York'/>

not bad - only a few CaSe sEnsiTive issues to sort out

What about other operators, say La Poste?

$ grep "operator" ~/Downloads/data.osm | sort | uniq -c | grep -i poste
1 <tag k='operator' v='Bureau de poste'/>
1 <tag k='operator' v='De Post - La Poste'/>
7 <tag k='operator' v='la poste'/>
21 <tag k='operator' v='la Poste'/>
12 <tag k='operator' v='La poste'/>
917 <tag k='operator' v='La Poste'/>
1 <tag k='operator' v='La Poste Belgique'/>
6 <tag k='operator' v='La Poste - De Post'/>
1 <tag k='operator' v='La Poste Suisse'/>
1 <tag k='operator' v='Le Poste'/>
1 <tag k='operator' v='poste'/>
5 <tag k='operator' v='Poste'/>

again - it's the 'long tail' problem. So, out of the ~85k entries how many unique operators?
404 (how apt for a web service)

and of those how many are singles? 222 - OVER HALF!

Wednesday, 21 July 2010

m m m metadata!

OK, in a semantic web kinda way, I've been looking at some of the clever machine tag integration that flickr are doing, and thinking about how these things *should* automatically link up.

Take for example and look at

There are many excellent postbox groups already but they all have, to my mind one problem - no structure enabling anyone to find anything. This group will only contain boxes which have their postcode in the title or tags, enabling easy searching.

if you can't find it, or reference it, it's useless.

I therefore propose to tag the postbox pics with 'ukpostbox:XXX_YYY where XXX is the 1st part of the postcode and YYY is the box ID

This means that 1) things like locating-postboxes could bring up a set of pics of the boxes, flickr could automatically link to posting times (uhm but that assumes the royal mail has an API ha ha ha ha ha). If people have followed the Tagging guidelines on openstreetmap, then you can even link them directly together.

Maybe sir Tims vision wasn't so bad after all :-)

Wednesday, 14 July 2010


Here at CERN we use AFS for our home directories on Linux, but the windows stuff all uses DFS with https webdav voodoo.

Discovered on Ubuntu 10.04 that if I save a nautilus bookmark with:

davs:// dfshome

it totally fails to connect. The correct syntax is simply:

davs:// dfshome

and put in *both* your username and password when prompted.
Oh, and they're stored in ~/.gtk-bookmarks incase you need to edit them :-)

Plotting Lustre MDS stats

At $dayjob we have several large filesystems - for example our /scratch system has 3.1 PB of space using over 1000 HDDs. Although each vendo...