OpenStreetMap tag genome: how are OSM objects tagged?

At Oslandia, we like working with Open Source tool projects and handling Open (geospatial) Data. In this article series, we will play with OpenStreetMap (OSM) and the subsequent data. We saw in the previous post how to parse OSM elements and integrate them into a data-oriented structure. Here comes the fourth article of this series, dedicated to the parsing and analysis of OSM tag genome, i.e. tag keys and values that are used within the OSM API.

1 OSM tag parsing

What kind of tags do we have to characterize OSM objects ? There are tag keys on the first hand and tag values on the other hand. It can be interesting to describe both sets.

1.1 Definition of a specified handler

On the model of the previous article parsing process, we can build a small class dedicated to tag information parsing. This class is defined as follows:

import osmium as osm
import pandas as pd

class TagGenomeHandler(osm.SimpleHandler):
    def __init__(self):
        osm.SimpleHandler.__init__(self)
        self.taggenome = []

    def tag_inventory(self, elem, elem_type):
        for tag in elem.tags:
            self.taggenome.append([elem_type, 
                                   elem.id, 
                                   elem.version, 
                                   tag.k, 
                                   tag.v])

    def node(self, n):
        self.tag_inventory(n, "node")

    def way(self, w):
        self.tag_inventory(w, "way")

    def relation(self, r):
        self.tag_inventory(r, "relation")

We introduce here the differentiation between OSM elements (node, way, relation): we see that it is fairly straightforward to parse tags for each element types.

In this version of the tag genome, we do not consider every history element versions. There are only versions for which elements are tagged. A simple merging procedure with the complete history can do the job, if needed (see in the next section).

1.2 Description of the tag genome in some examples

What we call a tag genome is actually a catalog of every tags associated with OSM objects, at each version. By applying the previous handler class to Bordeaux data, and by sampling the obtained genome, we can get exemples of tags:

taghandler = TagGenomeHandler()
taghandler.apply_file("../src/data/raw/bordeaux-metropole.osh.pbf")
colnames = ['type', 'id', 'version', 'tagkey', 'tagvalue']
tag_genome = pd.DataFrame(taghandler.taggenome, columns=colnames)
tag_genome.sample(10)

             type          id  version             tagkey                 tagvalue
2101454       way   274860212        6           alt_name  Côte de la Vieille Cure
14410        node   118378999        1  tiger:upload_uuid   bulk_upload.pl-6e90...
486325       node  2534869490        2            highway                     stop
1807400       way   173355414        1           building                      yes
872198        way    42071701        1               wood                    mixed
940593        way    77955619        1               wall                       no
1386900       way   154219551        1             source   cadastre-dgi-fr sou...
2175698  relation        8637      182           boundary           administrative
1382909       way   154217090        1           building                      yes
1631596       way   160788134        1           building                      yes

This sample shows that various kinds of tags exist; they characterize either roads, buildings and so on… If we consider a specific node, for instance the node characterized by ID n°21457126:

tag_genome.query("id == 21457144")

   type        id  version      tagkey       tagvalue
1  node  21457144        8  created_by  Potlatch 0.6b

We can see that there is only one version for which the element is tagged by only one single tag. This tag gives information on the editing tool used by the contributor. By enriching the tag genome with full OSM history, we can verify that the node is untagged in previous (and next) versions:

osm_history = pd.read_csv("../src/data/output-extracts/bordeaux-metropole/bordeaux-metropole-elements.csv")
enhanced_tag_genome = pd.merge(osm_history[['elem', 'id', 'version']], tag_genome, how='left', left_on=['elem', 'id', 'version'], right_on=['type', 'id', 'version'])
enhanced_tag_genome.query("id==21457144")

    elem        id  version  type      tagkey       tagvalue
47  node  21457144        2   NaN         NaN            NaN
48  node  21457144        3   NaN         NaN            NaN
49  node  21457144        4   NaN         NaN            NaN
50  node  21457144        5   NaN         NaN            NaN
51  node  21457144        6   NaN         NaN            NaN
52  node  21457144        7   NaN         NaN            NaN
53  node  21457144        8  node  created_by  Potlatch 0.6b
54  node  21457144        9   NaN         NaN            NaN

2 Analysis of the global tag genome

To go further and understand how OSM objects are tagged, we can provide a short statistical description of the tag genome, for the area of Bordeaux.

By focusing on simple tag description, we can identify some interesting points:

the number of tag keys is larger for nodes and ways, and smaller for relations:

tag_genome.groupby('type')['tagkey'].nunique()

type
node        647
relation    320
way         545
Name: tagkey, dtype: int64

the most frequent keys are source, building and highway, they are not uniformly distributed with respect to the three OSM types:

tagkeycount = (tag_genome.groupby(['tagkey','type'])['type']
               .count()
               .unstack()
               .fillna(0))
tagkeycount['total'] = tagkeycount.apply(sum, axis=1)
tagkeycount = tagkeycount.sort_values('total', ascending=False)
tagkeycount.head()

type          node  relation       way     total
tagkey                                          
source    152101.0    5613.0  461284.0  618998.0
building    2958.0     287.0  446139.0  449384.0
highway    23727.0      14.0  115576.0  139317.0
wall           0.0      22.0  124438.0  124460.0
name       18512.0   18341.0   67794.0  104647.0

complex elements such as relations tend to be more tagged than ways, which tend to be more tagged than nodes, if we consider the number of tags divided by the number of elements:

tag_genome.groupby(['type'])['version'].count() / osm_history.groupby(['elem'])['version'].count()

type
node        0.229626
relation    6.810917
way         2.437369
Name: version, dtype: float64

3 Analysis of the tag key/value frequency

What is the temporal evolution of object tags, and more specifically in terms of object version? By designing some functions focusing on OSM element versions, we can have a crucial overview of this aspect.

3.1 Tag key frequency

First we build a small function which investigates on the number of unique elements that are associated with given tag keys.

def tagkey_analysis(genome, pivot_var=['type']):
    return (genome.groupby(['tagkey', *pivot_var])['id']
            .nunique()
            .unstack()
            .fillna(0))
tagkey_overview = tagkey_analysis(enhanced_tag_genome, ['type', 'version'])
tagkey_overview.sort_values(1, ascending=False).iloc[:5,:5]

version                       1        2        3       4       5
tagkey           type                                            
source           way   355974.0  85095.0  13056.0  2861.0  1315.0
building         way   350504.0  81612.0  10592.0  1948.0   671.0
source           node  122482.0  16281.0  10392.0  1541.0   627.0
wall             way   103435.0  19001.0   1754.0   179.0    47.0
addr:housenumber node   86566.0   2882.0   1249.0   742.0   402.0

The previous results show that almost 356k ways of version 1 are tagged with the key source. This information could be even more interesting if we compare it with the total number of first-versionned ways.

def total_elem(genome, pivot_var=['type', 'version']):
    return genome.groupby(pivot_var)['id'].nunique().unstack().fillna(0)
total_elem(enhanced_tag_genome).iloc[:,:5]

version          1         2        3        4       5
type                                                  
node      151184.0   28366.0  15524.0   4292.0  2281.0
relation    5307.0    2546.0   1125.0    654.0   504.0
way       402413.0  109575.0  29578.0  14599.0  9964.0

This last table is a fundamental basis to understand the tag popularity. To recall our previous example, we see that there is more than 402k ways with version equal to 1, that means that the tag key source appears in around 88% of such cases.

Such a result can be generalized for all tuples (tag keys, element type), with subsequent Python procedure:

def tag_frequency(genome, pivot_var=['type', 'version']):
    total_uniqelem = total_elem(genome, pivot_var)
    tagcount = tagkey_analysis(genome, pivot_var)
    # Prepare data: group tag counts by element types
    tagcount_groups = tagcount.groupby(level='type')
    # For each type, compute the proportion of element tagged with each tag
    tag_freq = []
    for key, group in tagcount_groups:
        tag_freq.append( group / total_uniqelem.loc[key])
    # Regroup in one single dataframe and return
    tag_freq = pd.concat(tag_freq)
    return 100*tag_freq.round(4)

tag_frequency(enhanced_tag_genome, ['type','version']).sort_values(1, ascending=False).head(20)[[1,3,5,10,15]]

version                           1      3      5      10     15
tagkey               type                                       
type                 relation  97.32  97.07  97.42  98.57  99.00
source               way       88.46  44.14  13.20   7.27   5.65
building             way       87.10  35.81   6.73   1.31   0.22
source               node      81.02  66.94  27.49   9.52   1.27
name                 relation  70.40  88.00  89.88  91.07  91.04
addr:housenumber     node      57.26   8.05  17.62   0.28   0.00
source               relation  51.86  36.62  19.64  10.71   9.45
ref:FR:FANTOIR       relation  48.82  32.00   9.72   2.50   1.49
wall                 way       25.70   5.93   0.47   0.00   0.00
natural              node      18.53  40.05   0.26   0.00   0.00
start_date           node      17.32  39.99   0.75   0.56   0.00
ref:FR:bordeaux:tree node      17.31  40.02   0.26   0.00   0.00
circumfere           node      17.31  40.02   0.26   0.00   0.00
height               node      17.31  39.96   0.26   0.00   0.00
species              node      16.93  40.02   0.26   0.00   0.00
restriction          relation  11.31   3.64   1.19   0.00   0.00
note:import-bati     way       11.05   0.18   0.01   0.00   0.00
highway              way        7.97  49.69  78.75  82.28  80.22
                     node       7.30  15.45  37.88  43.14  37.97
public_transport     relation   5.18   4.71   2.18   0.71   0.00

As a result, we can see some seminal points in this tag genome, that are fundamental insights of how OSM contributors build the API objects.

For instance, source tags are intensively used in the first version of objects, but the coverage decreases when the objects are updated. The same scheme is applied for ways tagged as building. At the opposite, it is common to add the name tag after a few updates. The highway tag (for ways, no surprise) follows the same increasing trend versions after versions.

3.2 Tag value frequency

Like previously with tag keys, we can measure the popularity of tag values. As a remark, it wouldn’t be so smart to mix up every tag keys and to compare tag
values as various as those associated e.g. with building or parcs. Then we will only study a single reference tag key. For instance, we can focus on road data,
and evaluate how many highway tags are available on the API.

We get similar Python procedures, that take into account tag values with a given tag key.

def tagvalue_analysis(genome, key, pivot_var=['type']):
    return (genome.query("tagkey==@key")
            .groupby(['tagvalue', *pivot_var])['id']
            .nunique()
            .unstack()
            .fillna(0))
tagvalue_overview = tagvalue_analysis(tag_genome, 'highway', ['type', 'version'])
tagvalue_overview.sort_values(1, ascending=False).iloc[:5,:7]

version                 1       2       3       4       5       6       7
tagvalue    type                                                         
residential way   10971.0  9458.0  7201.0  5286.0  3795.0  2725.0  1999.0
service     way    7069.0  2777.0  1409.0   778.0   449.0   292.0   195.0
crossing    node   6338.0  2583.0  1022.0   434.0   205.0   107.0    59.0
footway     way    3797.0  1841.0   782.0   417.0   245.0   146.0    89.0
bus_stop    node   2742.0  2182.0   447.0   179.0    71.0    37.0    11.0

Here we see that the most frequent highway tag value when element are created is residential.

These figures will be compared to the total number of elements tagged as highways that correspond to each element type and version:

def tot_values(genome, key, pivot_var=['type', 'version']):
    return (genome.query("tagkey==@key")
                      .groupby(pivot_var)['id']
                      .nunique()
                      .unstack()
                      .fillna(0))
tot_values(tag_genome, 'highway')[[1,2,3,4,5,10,15]]

version        1        2        3        4       5       10     15
type                                                               
node      11038.0   6055.0   2398.0   1319.0   864.0   154.0   30.0
relation      7.0      3.0      1.0      0.0     0.0     0.0    0.0
way       32080.0  21065.0  14697.0  10632.0  7847.0  2140.0  738.0

That’s not so surprising: a large majority of highway elements are nodes or ways. The proportion of each tag values is computed with the following procedure:

def tagvalue_frequency(genome, key, pivot_var=['type', 'version']):
    total_uniqelem = tot_values(genome, key, pivot_var)
    tagcount = tagvalue_analysis(genome, key, pivot_var=['type','version'])
    tagcount_groups = tagcount.groupby(level='type')
    tag_freq = []
    for key, group in tagcount_groups:
        tag_freq.append( group / total_uniqelem.loc[key])
    tag_freq = pd.concat(tag_freq)
    return (100*tag_freq).round(4)
tagvalue_freq = tagvalue_frequency(tag_genome, 'highway', ['type','version']).swaplevel().sort_values(1, ascending=False)

Contrary to the tag key analysis, we can’t expect a 100% frequency for each tag value, as there can be only one tag value associated with a given key. For a sake of clarity, we can distinguish each element type to present the result:

The less used type: the relation

tagvalue_freq.loc['relation', [1,3,5,10,15]]

version            1      3   5   10  15
tagvalue                                
pedestrian    57.1429  100.0 NaN NaN NaN
raceway       14.2857    0.0 NaN NaN NaN
service       14.2857    0.0 NaN NaN NaN
unclassified  14.2857    0.0 NaN NaN NaN
motorway       0.0000    0.0 NaN NaN NaN

There are only 7 first-versionned relations that are highway-focused, 4 of them are tagged with the value pedestrian. Only one of these relations has a third version. There is no highway-related relation with a higher version number.

the intermediary type: the node

tagvalue_freq.loc['node', [1,3,5,10,15]].head(10)

version                 1        3        5        10       15
tagvalue                                                      
crossing           57.4198  42.6188  23.7269   9.7403   6.6667
bus_stop           24.8415  18.6405   8.2176   0.6494   0.0000
street_lamp         5.3180   0.0000   0.0000   0.0000   0.0000
traffic_signals     5.1912  25.6047  54.6296  68.8312  63.3333
turning_circle      2.9353   6.3803   2.1991   0.0000   3.3333
give_way            2.0112   0.2085   0.1157   0.0000   0.0000
stop                0.8607   0.2919   0.0000   0.0000   0.0000
mini_roundabout     0.5164   2.1268   0.9259   0.0000   0.0000
motorway_junction   0.3533   3.3778   8.9120  20.1299  26.6667
speed_camera        0.1721   0.1251   0.1157   0.0000   0.0000

When OSM contributors tag a new node as highway-related, in most cases the chosen value is crossing. We have also a large amount of bus_stop. The nodes tagged as traffic_signals or motorway_junction tend to reach higher versions.

We don’t say here that both values are the final labels of highway nodes (the previous table do not consider cumulated number of elements, for different version, but pictures of each version taken separately)! However an interpretation is still possible: we can consider that contributor unanimity takes more time for such nodes…

the most natural type: the way

tagvalue_freq.loc['way', [1,3,5,10,15]].head(10)

version            1        3        5        10       15
tagvalue                                                 
residential   34.1989  48.9964  48.3624  36.3084  26.6938
service       22.0355   9.5870   5.7219   2.9907   1.7615
footway       11.8360   5.3208   3.1222   1.3551   0.2710
unclassified   6.0661   7.8179   7.6207   5.9346   4.3360
tertiary       4.9314   7.4913  10.6665  18.0374  25.0678
path           4.1397   1.8099   1.3126   0.2336   0.0000
cycleway       3.8996   3.3068   3.1350   2.8037   2.4390
secondary      3.4819   4.9806   6.7669  11.2150  15.0407
primary        1.8267   2.8033   3.4663   5.6075   9.0786
track          1.3217   0.5511   0.2804   0.0467   0.1355

As for relations and nodes, the repartition of tag values for each way version gives some information on the manner OSM contributors enrich the API. A third
of newly created highway-related ways are tagged as residential. The proportion of such ways remains relatively high versions after versions: they are intensively updated by contributors!

As a last remark, we can compare the tag value distribution with the global highway tag distribution: the Bordeaux area seems to be represented with a larger quantity of footway, secondary and tertiary highways, but with a smaller amount of track tags. Sufficient to say this area is urban, without any prior knowledge of the sub-region…?

4 Conclusion

The rich analysis proposed in this article have shown that dig into the OSM tag set is a demanding but fascinating task. A lot of insights are available to whom is able to let the data do the talking. In such an exercise, we have proposed some tracks, however there is still so much more to do!

In the next article, we will close this parenthesis and come back to our first objective: the OSM data quality. We will consider the metadata extraction, as a first step towards the quality measurement.