Friday, 23 July 2010

more data musings

(the advantage of traveling by public transport once in a while is you can sit and faff on laptop)


More data musings

The only guarantee about user entered data is that, given enough entries it'll be inconsistent :-(

take for example an openstreetmap xapi query to pull out '/api/0.6/*[amenity=post_box]'

which is nice dataset of ~85k enties which I'll use for some simple analysis

So, the UK has ~40k postboxes, of which according to draco the breakdown of entries from the count are sources as follows:
13.5k - osm, 26.7k - website.

so of those 13504 UK postboxes in OSM, how many are royal mail run (hint - most of them!)
does the data match?

$ grep "operator" ~/Downloads/data.osm | sort | uniq -c | grep -i royal
1 <tag k='operator' v='Post Office: Royal Mail'/>
1 <tag k='operator' v='royal mail'/>
1 <tag k='operator' v='Royal mail'/>
5065 <tag k='operator' v='Royal Mail'/>
1 <tag k='operator' v='RoyalMail'/>
1 <tag k='operator' v='Royal MAil'/>
1 <tag k='operator' v='Royal Mail Warwick'/>
2 <tag k='operator' v='Royal York'/>

not bad - only a few CaSe sEnsiTive issues to sort out

What about other operators, say La Poste?

$ grep "operator" ~/Downloads/data.osm | sort | uniq -c | grep -i poste
1 <tag k='operator' v='Bureau de poste'/>
1 <tag k='operator' v='De Post - La Poste'/>
7 <tag k='operator' v='la poste'/>
21 <tag k='operator' v='la Poste'/>
12 <tag k='operator' v='La poste'/>
917 <tag k='operator' v='La Poste'/>
1 <tag k='operator' v='La Poste Belgique'/>
6 <tag k='operator' v='La Poste - De Post'/>
1 <tag k='operator' v='La Poste Suisse'/>
1 <tag k='operator' v='Le Poste'/>
1 <tag k='operator' v='poste'/>
5 <tag k='operator' v='Poste'/>

again - it's the 'long tail' problem. So, out of the ~85k entries how many unique operators?
404 (how apt for a web service)

and of those how many are singles? 222 - OVER HALF!

1 comment:

qu1j0t3 said...

What, nobody put in "Consignia" as a joke? :)

Plotting Lustre MDS stats

At $dayjob we have several large filesystems - for example our /scratch system has 3.1 PB of space using over 1000 HDDs. Although each vendo...