At Oslandia, we like working with Open Source tool projects and handling Open (geospatial) Data. In this article series, we will play with OpenStreetMap (OSM) and the subsequent data. We saw in the previous post how to parse OSM elements and integrate them into a data-oriented structure. Here comes the fourth article of this series, dedicated to the parsing and analysis of OSM tag genome, i.e. tag keys and values that are used within the OSM API.
1 OSM tag parsing
What kind of tags do we have to characterize OSM objects ? There are tag keys on the first hand and tag values on the other hand. It can be interesting to describe both sets.
1.1 Definition of a specified handler
On the model of the previous article parsing process, we can build a small class dedicated to tag information parsing. This class is defined as follows:
import osmium as osm import pandas as pd class TagGenomeHandler(osm.SimpleHandler): def __init__(self): osm.SimpleHandler.__init__(self) self.taggenome = [] def tag_inventory(self, elem, elem_type): for tag in elem.tags: self.taggenome.append([elem_type, elem.id, elem.version, tag.k, tag.v]) def node(self, n): self.tag_inventory(n, "node") def way(self, w): self.tag_inventory(w, "way") def relation(self, r): self.tag_inventory(r, "relation")
We introduce here the differentiation between OSM elements (node, way, relation): we see that it is fairly straightforward to parse tags for each element types.
In this version of the tag genome, we do not consider every history element versions. There are only versions for which elements are tagged. A simple merging procedure with the complete history can do the job, if needed (see in the next section).
1.2 Description of the tag genome in some examples
What we call a tag genome is actually a catalog of every tags associated with OSM objects, at each version. By applying the previous handler class to Bordeaux data, and by sampling the obtained genome, we can get exemples of tags:
taghandler = TagGenomeHandler() taghandler.apply_file("../src/data/raw/bordeaux-metropole.osh.pbf") colnames = ['type', 'id', 'version', 'tagkey', 'tagvalue'] tag_genome = pd.DataFrame(taghandler.taggenome, columns=colnames) tag_genome.sample(10)
type id version tagkey tagvalue 2101454 way 274860212 6 alt_name Côte de la Vieille Cure 14410 node 118378999 1 tiger:upload_uuid bulk_upload.pl-6e90... 486325 node 2534869490 2 highway stop 1807400 way 173355414 1 building yes 872198 way 42071701 1 wood mixed 940593 way 77955619 1 wall no 1386900 way 154219551 1 source cadastre-dgi-fr sou... 2175698 relation 8637 182 boundary administrative 1382909 way 154217090 1 building yes 1631596 way 160788134 1 building yes
This sample shows that various kinds of tags exist; they characterize either roads, buildings and so on… If we consider a specific node, for instance the node characterized by ID n°21457126:
tag_genome.query("id == 21457144")
type id version tagkey tagvalue 1 node 21457144 8 created_by Potlatch 0.6b
We can see that there is only one version for which the element is tagged by only one single tag. This tag gives information on the editing tool used by the contributor. By enriching the tag genome with full OSM history, we can verify that the node is untagged in previous (and next) versions:
osm_history = pd.read_csv("../src/data/output-extracts/bordeaux-metropole/bordeaux-metropole-elements.csv") enhanced_tag_genome = pd.merge(osm_history[['elem', 'id', 'version']], tag_genome, how='left', left_on=['elem', 'id', 'version'], right_on=['type', 'id', 'version']) enhanced_tag_genome.query("id==21457144")
elem id version type tagkey tagvalue 47 node 21457144 2 NaN NaN NaN 48 node 21457144 3 NaN NaN NaN 49 node 21457144 4 NaN NaN NaN 50 node 21457144 5 NaN NaN NaN 51 node 21457144 6 NaN NaN NaN 52 node 21457144 7 NaN NaN NaN 53 node 21457144 8 node created_by Potlatch 0.6b 54 node 21457144 9 NaN NaN NaN
2 Analysis of the global tag genome
To go further and understand how OSM objects are tagged, we can provide a short statistical description of the tag genome, for the area of Bordeaux.
By focusing on simple tag description, we can identify some interesting points:
- the number of tag keys is larger for nodes and ways, and smaller for relations:
tag_genome.groupby('type')['tagkey'].nunique()
type node 647 relation 320 way 545 Name: tagkey, dtype: int64
- the most frequent keys are
source
,building
andhighway
, they are not uniformly distributed with respect to the three OSM types:
tagkeycount = (tag_genome.groupby(['tagkey','type'])['type'] .count() .unstack() .fillna(0)) tagkeycount['total'] = tagkeycount.apply(sum, axis=1) tagkeycount = tagkeycount.sort_values('total', ascending=False) tagkeycount.head()
type node relation way total tagkey source 152101.0 5613.0 461284.0 618998.0 building 2958.0 287.0 446139.0 449384.0 highway 23727.0 14.0 115576.0 139317.0 wall 0.0 22.0 124438.0 124460.0 name 18512.0 18341.0 67794.0 104647.0
- complex elements such as relations tend to be more tagged than ways, which tend to be more tagged than nodes, if we consider the number of tags divided by the number of elements:
tag_genome.groupby(['type'])['version'].count() / osm_history.groupby(['elem'])['version'].count()
type node 0.229626 relation 6.810917 way 2.437369 Name: version, dtype: float64
3 Analysis of the tag key/value frequency
What is the temporal evolution of object tags, and more specifically in terms of object version? By designing some functions focusing on OSM element versions, we can have a crucial overview of this aspect.
3.1 Tag key frequency
First we build a small function which investigates on the number of unique elements that are associated with given tag keys.
def tagkey_analysis(genome, pivot_var=['type']): return (genome.groupby(['tagkey', *pivot_var])['id'] .nunique() .unstack() .fillna(0)) tagkey_overview = tagkey_analysis(enhanced_tag_genome, ['type', 'version']) tagkey_overview.sort_values(1, ascending=False).iloc[:5,:5]
version 1 2 3 4 5 tagkey type source way 355974.0 85095.0 13056.0 2861.0 1315.0 building way 350504.0 81612.0 10592.0 1948.0 671.0 source node 122482.0 16281.0 10392.0 1541.0 627.0 wall way 103435.0 19001.0 1754.0 179.0 47.0 addr:housenumber node 86566.0 2882.0 1249.0 742.0 402.0
The previous results show that almost 356k ways of version 1 are tagged with the key source
. This information could be even more interesting if we compare it with the total number of first-versionned ways.
def total_elem(genome, pivot_var=['type', 'version']): return genome.groupby(pivot_var)['id'].nunique().unstack().fillna(0) total_elem(enhanced_tag_genome).iloc[:,:5]
version 1 2 3 4 5 type node 151184.0 28366.0 15524.0 4292.0 2281.0 relation 5307.0 2546.0 1125.0 654.0 504.0 way 402413.0 109575.0 29578.0 14599.0 9964.0
This last table is a fundamental basis to understand the tag popularity. To recall our previous example, we see that there is more than 402k ways with version equal to 1, that means that the tag key source
appears in around 88% of such cases.
Such a result can be generalized for all tuples (tag keys, element type), with subsequent Python procedure:
def tag_frequency(genome, pivot_var=['type', 'version']): total_uniqelem = total_elem(genome, pivot_var) tagcount = tagkey_analysis(genome, pivot_var) # Prepare data: group tag counts by element types tagcount_groups = tagcount.groupby(level='type') # For each type, compute the proportion of element tagged with each tag tag_freq = [] for key, group in tagcount_groups: tag_freq.append( group / total_uniqelem.loc[key]) # Regroup in one single dataframe and return tag_freq = pd.concat(tag_freq) return 100*tag_freq.round(4)
tag_frequency(enhanced_tag_genome, ['type','version']).sort_values(1, ascending=False).head(20)[[1,3,5,10,15]]
version 1 3 5 10 15 tagkey type type relation 97.32 97.07 97.42 98.57 99.00 source way 88.46 44.14 13.20 7.27 5.65 building way 87.10 35.81 6.73 1.31 0.22 source node 81.02 66.94 27.49 9.52 1.27 name relation 70.40 88.00 89.88 91.07 91.04 addr:housenumber node 57.26 8.05 17.62 0.28 0.00 source relation 51.86 36.62 19.64 10.71 9.45 ref:FR:FANTOIR relation 48.82 32.00 9.72 2.50 1.49 wall way 25.70 5.93 0.47 0.00 0.00 natural node 18.53 40.05 0.26 0.00 0.00 start_date node 17.32 39.99 0.75 0.56 0.00 ref:FR:bordeaux:tree node 17.31 40.02 0.26 0.00 0.00 circumfere node 17.31 40.02 0.26 0.00 0.00 height node 17.31 39.96 0.26 0.00 0.00 species node 16.93 40.02 0.26 0.00 0.00 restriction relation 11.31 3.64 1.19 0.00 0.00 note:import-bati way 11.05 0.18 0.01 0.00 0.00 highway way 7.97 49.69 78.75 82.28 80.22 node 7.30 15.45 37.88 43.14 37.97 public_transport relation 5.18 4.71 2.18 0.71 0.00
As a result, we can see some seminal points in this tag genome, that are fundamental insights of how OSM contributors build the API objects.
For instance, source
tags are intensively used in the first version of objects, but the coverage decreases when the objects are updated. The same scheme is applied for ways tagged as building
. At the opposite, it is common to add the name
tag after a few updates. The highway
tag (for ways, no surprise) follows the same increasing trend versions after versions.
3.2 Tag value frequency
Like previously with tag keys, we can measure the popularity of tag values. As a remark, it wouldn’t be so smart to mix up every tag keys and to compare tag
values as various as those associated e.g. with building or parcs. Then we will only study a single reference tag key. For instance, we can focus on road data,
and evaluate how many highway
tags are available on the API.
We get similar Python procedures, that take into account tag values with a given tag key.
def tagvalue_analysis(genome, key, pivot_var=['type']): return (genome.query("tagkey==@key") .groupby(['tagvalue', *pivot_var])['id'] .nunique() .unstack() .fillna(0)) tagvalue_overview = tagvalue_analysis(tag_genome, 'highway', ['type', 'version']) tagvalue_overview.sort_values(1, ascending=False).iloc[:5,:7]
version 1 2 3 4 5 6 7 tagvalue type residential way 10971.0 9458.0 7201.0 5286.0 3795.0 2725.0 1999.0 service way 7069.0 2777.0 1409.0 778.0 449.0 292.0 195.0 crossing node 6338.0 2583.0 1022.0 434.0 205.0 107.0 59.0 footway way 3797.0 1841.0 782.0 417.0 245.0 146.0 89.0 bus_stop node 2742.0 2182.0 447.0 179.0 71.0 37.0 11.0
Here we see that the most frequent highway
tag value when element are created is residential
.
These figures will be compared to the total number of elements tagged as highways
that correspond to each element type and version:
def tot_values(genome, key, pivot_var=['type', 'version']): return (genome.query("tagkey==@key") .groupby(pivot_var)['id'] .nunique() .unstack() .fillna(0)) tot_values(tag_genome, 'highway')[[1,2,3,4,5,10,15]]
version 1 2 3 4 5 10 15 type node 11038.0 6055.0 2398.0 1319.0 864.0 154.0 30.0 relation 7.0 3.0 1.0 0.0 0.0 0.0 0.0 way 32080.0 21065.0 14697.0 10632.0 7847.0 2140.0 738.0
That’s not so surprising: a large majority of highway elements are nodes or ways. The proportion of each tag values is computed with the following procedure:
def tagvalue_frequency(genome, key, pivot_var=['type', 'version']): total_uniqelem = tot_values(genome, key, pivot_var) tagcount = tagvalue_analysis(genome, key, pivot_var=['type','version']) tagcount_groups = tagcount.groupby(level='type') tag_freq = [] for key, group in tagcount_groups: tag_freq.append( group / total_uniqelem.loc[key]) tag_freq = pd.concat(tag_freq) return (100*tag_freq).round(4) tagvalue_freq = tagvalue_frequency(tag_genome, 'highway', ['type','version']).swaplevel().sort_values(1, ascending=False)
Contrary to the tag key analysis, we can’t expect a 100% frequency for each tag value, as there can be only one tag value associated with a given key. For a sake of clarity, we can distinguish each element type to present the result:
- The less used type: the relation
tagvalue_freq.loc['relation', [1,3,5,10,15]]
version 1 3 5 10 15 tagvalue pedestrian 57.1429 100.0 NaN NaN NaN raceway 14.2857 0.0 NaN NaN NaN service 14.2857 0.0 NaN NaN NaN unclassified 14.2857 0.0 NaN NaN NaN motorway 0.0000 0.0 NaN NaN NaN
There are only 7 first-versionned relations that are highway-focused, 4 of them are tagged with the value pedestrian
. Only one of these relations has a third version. There is no highway-related relation with a higher version number.
- the intermediary type: the node
tagvalue_freq.loc['node', [1,3,5,10,15]].head(10)
version 1 3 5 10 15 tagvalue crossing 57.4198 42.6188 23.7269 9.7403 6.6667 bus_stop 24.8415 18.6405 8.2176 0.6494 0.0000 street_lamp 5.3180 0.0000 0.0000 0.0000 0.0000 traffic_signals 5.1912 25.6047 54.6296 68.8312 63.3333 turning_circle 2.9353 6.3803 2.1991 0.0000 3.3333 give_way 2.0112 0.2085 0.1157 0.0000 0.0000 stop 0.8607 0.2919 0.0000 0.0000 0.0000 mini_roundabout 0.5164 2.1268 0.9259 0.0000 0.0000 motorway_junction 0.3533 3.3778 8.9120 20.1299 26.6667 speed_camera 0.1721 0.1251 0.1157 0.0000 0.0000
When OSM contributors tag a new node as highway-related, in most cases the chosen value is crossing
. We have also a large amount of bus_stop
. The nodes tagged as traffic_signals
or motorway_junction
tend to reach higher versions.
We don’t say here that both values are the final labels of highway nodes (the previous table do not consider cumulated number of elements, for different version, but pictures of each version taken separately)! However an interpretation is still possible: we can consider that contributor unanimity takes more time for such nodes…
- the most natural type: the way
tagvalue_freq.loc['way', [1,3,5,10,15]].head(10)
version 1 3 5 10 15 tagvalue residential 34.1989 48.9964 48.3624 36.3084 26.6938 service 22.0355 9.5870 5.7219 2.9907 1.7615 footway 11.8360 5.3208 3.1222 1.3551 0.2710 unclassified 6.0661 7.8179 7.6207 5.9346 4.3360 tertiary 4.9314 7.4913 10.6665 18.0374 25.0678 path 4.1397 1.8099 1.3126 0.2336 0.0000 cycleway 3.8996 3.3068 3.1350 2.8037 2.4390 secondary 3.4819 4.9806 6.7669 11.2150 15.0407 primary 1.8267 2.8033 3.4663 5.6075 9.0786 track 1.3217 0.5511 0.2804 0.0467 0.1355
As for relations and nodes, the repartition of tag values for each way version gives some information on the manner OSM contributors enrich the API. A third
of newly created highway-related ways are tagged as residential
. The proportion of such ways remains relatively high versions after versions: they are intensively updated by contributors!
As a last remark, we can compare the tag value distribution with the global highway tag distribution: the Bordeaux area seems to be represented with a larger quantity of footway
, secondary
and tertiary
highways, but with a smaller amount of track
tags. Sufficient to say this area is urban, without any prior knowledge of the sub-region…?
4 Conclusion
The rich analysis proposed in this article have shown that dig into the OSM tag set is a demanding but fascinating task. A lot of insights are available to whom is able to let the data do the talking. In such an exercise, we have proposed some tracks, however there is still so much more to do!
In the next article, we will close this parenthesis and come back to our first objective: the OSM data quality. We will consider the metadata extraction, as a first step towards the quality measurement.