OSM metadata description: the data behind the data

At Oslandia, we like working with Open Source tool projects and handling Open (geospatial) Data. In this article series, we will play with the OpenStreetMap (OSM) map and subsequent data. Here comes the fifth article of this series, dedicated to the extraction of OSM metadata starting from the OSM history data.

1 Extract OSM data history

1.1 An air of “deja vu”

In previous articles, dedicated to OSM chronological evolution and OSM tag set analysis, we saw examples of OSM data parsing. Here we will use a similar parser to get the OSM history, i.e. every versions of every objects for a given area within the OSM API.

import osmium as osm
import pandas as pd

class TimelineHandler(osm.SimpleHandler):
    def __init__(self):
        osm.SimpleHandler.__init__(self)
        self.elements = []

    def add_elements(self, e, elem_type):
        self.elements.append([elem_type,
                              e.id,
                              e.version,
                              e.visible,
                              pd.Timestamp(e.timestamp),
                              e.uid,
                              e.changeset])

    def node(self, n):
        self.add_elements(n, 'node')

    def way(self, w):
        self.add_elements(w, 'way')

    def relation(self, r):
        self.add_elements(r, 'relation')

tlhandler = TimelineHandler()
tlhandler.apply_file("../src/data/raw/bordeaux-metropole.osh.pbf")
colnames = ['type', 'id', 'version', 'visible', 'ts', 'uid', 'chgset']
elements = pd.DataFrame(tlhandler.elements, columns=colnames)
elements = elements.sort_values(by=['type', 'id', 'ts'])

If you read the previous article, this parsing class should be familiar to you. Let’s briefly recall its meaning: we read a native OSM file bordeaux-metropole.osh.pbf with the help of the pyosmium library, we store elements and basic information like versions, timestamps, user ids (and so on…) into a simple Python structure. In the end, using a Pandas DataFrame is even clearer. This structure can be saved on the disk, or handled directly within the Python workspace.

1.2 Towards the metadata

Here we have the OSM data history corresponding to the area of Bordeaux. It is largely descriptive, however it says nothing about data quality in itself.

We could compare the OSM data with some alternative data source, however there is no real consensus about the perfectness of any alternative data set. Knowing that, we choose to analyse metadata associated with OSM elements instead of data themselves.

Some data agreggation operations are needed in this way, to better describe the OSM object evolutions.

2 What do we mean by metadata?

Here we plan to go deeper than just considering the geometric data: some additional information is available if we check the history of contributions, and if we focus on different changesets, users, or even on meta-information about OSM elements themselves.

2.1 Element metadata

That’s sound quite evident on the first sight, but why don’t we just look at the OSM elements? Two major drawbacks prevent us from working with such objects:

in reality, no real add-on is made in terms of quality assessment: variables associated with elements are number of versions, of tags, of contributors and so on… It provides very interesting pictures of OSM data, but stays largely descriptive.
it’s fairly intensive in terms of computing resources! We got around 3 millions OSM elements only for the Bordeaux area, we let you imagine for a whole country, a continent or even for the entire world…

2.2 Changeset metadata

Investigate on changesets could provide an additional information source. We know that elements are modified in the context of an opened changeset.

We can count the number of modifications, assessing if some new elements have been created during a given changeset, or count the amount of elements that have been modified or even deleted.

Classify changesets is possible: we could hypothesize that there are “productive” changesets, where a large amount of nodes, ways and/or relations are modified, and where modifications are durable from one hand; and less productive ones (we don’t say “useless”!) with less creations, less modifications and less durable elements.

However the information may be gathered more efficiently by considering those who produce changesets: the contributors themselves.

2.3 User metadata

We can hypothesize that a user who contributes a lot, on every kind of OSM elements, and whose contributions stay valid for a long time (or even: are still valid!) is an experienced user; and the elements on which he has contributed are well-represented.

The link between users and elements is more natural than between changesets and elements : it is possible to characterize OSM data quality by considering which type of user contributes the most to each element. In a more simple way, we can consider the most experienced user who’ve contributed on an element as a flag about the element quality. The quality of an element may also be indicated by the type (more or less experienced) of its last contributor.

This last hypothesis will be our central thread in the next section (and… *spoiler warning* in the next articles!).

3 Extract user metadata

We decide to go deeper into the analysis of user contributions. As a reminder, we’ve extracted OSM data history in the first section of this blog article. We can pick one element randomly as an example:

elements.sample().T

                             61781
type                          node
id                       260953178
version                          3
visible                       True
ts       2010-07-31 18:23:54+00:00
uid                          53048
chgset                     5364764

Here comes the time to consider each user.

3.1 Time-related features

We begin with the time they’ve spent on OSM.

user_md = (elements.groupby('uid')['ts']
            .agg(["min", "max"])
            .reset_index())
user_md.columns = ['uid', 'first_at', 'last_at']
user_md['lifespan'] = ((user_md.last_at - user_md.first_at)
                        / pd.Timedelta('1d'))
extraction_date = elements.ts.max()
user_md['n_inscription_days'] = ((extraction_date - user_md.first_at)
                                  / pd.Timedelta('1d'))
elements['ts_round'] = elements.ts.apply(lambda x: x.round('d'))
user_md['n_activity_days'] = (elements
                              .groupby('uid')['ts_round']
                              .nunique()
                              .reset_index())['ts_round']
user_md.sort_values(by=['first_at'])
user_md.query('uid == 4074141').T

                                         1960
uid                                   4074141
first_at            2016-06-06 14:25:01+00:00
last_at             2016-06-09 12:39:47+00:00
lifespan                              2.92692
n_inscription_days                    258.186
n_activity_days                             2

With these short code lines, we have gathered some temporal features in order to know how each user contributes through time. In the provided example, the user with the ID=4074141 is registered as an OSM contributor for 258 days; his lifespan on the OSM website is almost 3 days; he (or she!) made modifications at two different days.

3.2 Change-set-related features

Then we can focus on change-set-related information. By definition, each user has opened at least one changeset (yes, even you, if you’ve contributed!)! Let’s construct a small changeset metadata DataFrame:

chgset_md = (elements.groupby('chgset')['ts']
              .agg(["min", "max"])
              .reset_index())
chgset_md.columns = ['chgset', 'first_at', 'last_at']
chgset_md['duration'] = ((chgset_md.last_at - chgset_md.first_at)
                          / pd.Timedelta('1m'))
chgset_md = pd.merge(chgset_md,
                     elements[['chgset','uid']].drop_duplicates(),
                     on=['chgset'])
chgset_md.sample().T

                              33265
chgset                     44138773
first_at  2016-12-03 15:02:58+00:00
last_at   2016-12-03 15:02:58+00:00
duration                          0
uid                         4867435

Each changeset is associated with its starting and ending time, its duration (in minutes) and the responsible user. We then may associate a changeset quantity and mean duration time for each user.

user_md['n_chgset'] = (chgset_md.groupby('uid')['chgset']
                       .count()
                       .reset_index())['chgset']
user_md['dmean_chgset'] = (chgset_md.groupby('uid')['duration']
                           .mean()
                           .reset_index())['duration']
user_md.query('uid == 4074141').T

                                         1960
uid                                   4074141
first_at            2016-06-06 14:25:01+00:00
last_at             2016-06-09 12:39:47+00:00
lifespan                              2.92692
n_inscription_days                    258.186
n_activity_days                             2
n_chgset                                    3
dmean_chgset                          22.2778

Wow, there are some new interesting information there: we know that user 4074141 had produced three changesets during its lifespan, and the mean duration of these changesets is around 22 minutes.

3.3 Contribution intensity

Then we observed on some preliminary observation that some users were so productive that they modified some elements several times; a typical bot-like behavior if this amount is large, or simple auto-corrections? We can add this information as follows:

contrib_byelem = (elements.groupby(['type', 'id', 'uid'])['version']
                  .count()
                  .reset_index())
user_md['nmean_modif_byelem'] = (contrib_byelem.groupby('uid')['version']
                                 .mean()
                                 .reset_index())['version']
user_md.query('uid == 4074141').T

                                         1960
uid                                   4074141
first_at            2016-06-06 14:25:01+00:00
last_at             2016-06-09 12:39:47+00:00
lifespan                              2.92692
n_inscription_days                    258.186
n_activity_days                             2
n_chgset                                    3
dmean_chgset                          22.2778
nmean_modif_byelem                    2.94061

Oh-oh… Our nice user 4074141 seems to modify OSM elements almost three times. That’s quite few to conclude to his bot nature, however he seems quite unsure about the contributions he makes…

3.4 Element-related features

In order to characterize how the user contributes, a lot of additional features are still missing. The most important ones are related to the amount of modifications.

newfeature = (elements.groupby(['uid'])['id']
              .count()
              .reset_index()
              .fillna(0))
newfeature.columns = ['uid', "n_total_modif"]
user_md = pd.merge(user_md, newfeature, on='uid', how="outer").fillna(0)
newfeature = (elements.query('type == "node"').groupby(['uid'])['id']
              .count()
              .reset_index()
              .fillna(0))
newfeature.columns = ['uid', "n_total_modif_node"]
user_md = pd.merge(user_md, newfeature, on='uid', how="outer").fillna(0)
newfeature = (elements.query('type == "way"').groupby(['uid'])['id']
              .count()
              .reset_index()
              .fillna(0))
newfeature.columns = ['uid', "n_total_modif_way"]
user_md = pd.merge(user_md, newfeature, on='uid', how="outer").fillna(0)
newfeature = (elements.query('type == "relation"').groupby(['uid'])['id']
              .count()
              .reset_index()
              .fillna(0))
newfeature.columns = ['uid', "n_total_modif_relation"]
user_md = pd.merge(user_md, newfeature, on='uid', how="outer").fillna(0)

user_md.query('uid==4074141').T

                                             1960
uid                                       4074141
first_at                2016-06-06 14:25:01+00:00
last_at                 2016-06-09 12:39:47+00:00
lifespan                                  2.92692
n_inscription_days                        258.186
n_activity_days                                 2
n_chgset                                        3
dmean_chgset                              22.2778
nmean_modif_byelem                        2.94061
n_total_modif                                1832
n_total_modif_node                           1783
n_total_modif_way                              46
n_total_modif_relation                          3

Ok! This user is very active to map the Bordeaux area! He proposed 1832 modifications, amongst which 1783, 46 and 3 were respectively dedicated to nodes, ways and relations. However the amount of modified elements should be smaller, as this user made several contributions per element, on average.

Here you can see a difference between the number of changesets and the number of element modifications. As a reminder, a changeset can include several modifications, see [[https://wiki.openstreetmap.org/wiki/Changeset][the description of OSM changesets]] on the official OSM wiki.

The number of modifications can be described even more finely! Why don’t we consider if modifications are still valid, or if other modifications arise after the user action? What about elements that have been deleted since (we consider that working on a useless element is not so valuable for the community)?

3.5 Modification-related features

Actually it is possible, however we need associating a little bit more features to OSM elements. Is the current version an initialization of the object? Is it the up-to-date version? Will it be corrected (by an alternative user or the current user himself?)

import numpy as np

osmelem_versioning = (elements.groupby(['type', 'id'])['version']
            .agg(["first", "last"])
            .reset_index())
osmelem_versioning.columns = ['type', 'id', 'vmin', 'vmax']

elements = pd.merge(elements, osmelem_versioning, on=['type', 'id'])
elements['init'] = elements.version == elements.vmin
elements['up_to_date'] = elements.version == elements.vmax
# note that 'elements' is sorted by type, id and ts
elements['willbe_corr'] = np.logical_and(elements.id.diff(-1)==0,
                                          elements.uid.diff(-1)!=0)
elements['willbe_autocorr'] = np.logical_and(elements.id.diff(-1)==0,
                                                 elements.uid
                                                 .diff(-1)==0)

elements.query("id == 1751399951").T

                                   1620248
type                                  node
id                              1751399951
version                                  1
visible                               True
ts               2012-05-13 17:53:14+00:00
uid                                 260584
chgset                            11588470
ts_round         2012-05-14 00:00:00+00:00
vmin                                     1
vmax                                     1
init                                  True
up_to_date                            True
willbe_corr                          False
willbe_autocorr                      False

Here is a short example of new enriched element definition: now we know that the node of ID 1751399951 has only one version, so this version of course corresponds to its initialization and is up-to-date. As there is no second version (until the extraction date!), the element is not (auto-)corrected yet. These features help to describe more precisely the user contributions:

def create_count_features(metadata, element_type, data, grp_feat, res_feat, feature_suffix):
    feature_name = 'n_'+ element_type + '_modif' + feature_suffix
    newfeature = (data.groupby([grp_feat])[res_feat]
                  .count()
                  .reset_index()
                  .fillna(0))
    newfeature.columns = [grp_feat, feature_name]
    metadata = pd.merge(metadata, newfeature, on=grp_feat, how="outer").fillna(0)
    return metadata

def extract_modif_features(metadata, data, element_type):
    typed_data = data.query('type==@element_type')
    metadata = create_count_features(metadata, element_type, typed_data,
                               'uid', 'id', '')
    metadata = create_count_features(metadata, element_type,
                               typed_data.query("init"),
                               'uid', 'id', "_cr")
    metadata = create_count_features(metadata, element_type,
                               typed_data.query("not init and visible"),
                               'uid', 'id', "_imp")
    metadata = create_count_features(metadata, element_type,
                               typed_data.query("not init and not visible"),
                               'uid', 'id', "_del")
    metadata = create_count_features(metadata, element_type,
                               typed_data.query("up_to_date"),
                               'uid', 'id', "_utd")
    metadata = create_count_features(metadata, element_type,
                               typed_data.query("willbe_corr"),
                               'uid', 'id', "_cor")
    metadata = create_count_features(metadata, element_type,
                               typed_data.query("willbe_autocorr"),
                               'uid', 'id', "_autocor")
    return metadata

user_md = extract_modif_features(user_md, elements, 'node')
user_md = extract_modif_features(user_md, elements, 'way')
user_md = extract_modif_features(user_md, elements, 'relation')
user_md = user_md.set_index('uid')
user_md.query("uid == 4074141").T

uid                                         4074141
first_at                  2016-06-06 14:25:01+00:00
last_at                   2016-06-09 12:39:47+00:00
lifespan                                    2.92692
n_inscription_days                          258.186
n_activity_days                                   2
n_chgset                                          3
dmean_chgset                                22.2778
nmean_modif_byelem                          2.94061
n_total_modif                                  1832
n_total_modif_node                             1783
n_total_modif_way                                46
n_total_modif_relation                            3
n_node_modif                                   1783
n_node_modif_cr                                   0
n_node_modif_imp                               1783
n_node_modif_del                                  0
n_node_modif_utd                                  0
n_node_modif_cor                                598
n_node_modif_autocor                           1185
n_way_modif                                      46
n_way_modif_cr                                    0
n_way_modif_imp                                  46
n_way_modif_del                                   0
n_way_modif_utd                                   0
n_way_modif_cor                                  23
n_way_modif_autocor                              23
n_relation_modif                                  3
n_relation_modif_cr                               0
n_relation_modif_imp                              3
n_relation_modif_del                              0
n_relation_modif_utd                              0
n_relation_modif_cor                              2
n_relation_modif_autocor                          1

That’s a complete picture of the 4074141 user contribution, isn’t it? Amongst the 1783 modifications on node, there are…1783 improvements (so, no creation, no deletion). 598 of these modifications have been corrected by other users, and 1185 of them refer to auto-corrections; but no modification result in up-to-date nodes! We can draw a comparable picture for ways and relations. As a result, we have identified a user that contributes a lot to improve OSM elements; however his contributions are never enough to complete the element representation.

By considering every single user that has contributed on a given area, we can easily imagine that some groups could arise.

4 Conclusion

In this new blog post, we have presented some generic information about OSM contribution history. We’ve seen that user metadata can be easily built by some aggregation operations starting from OSM data history. We have proposed a bunch of features to characterize as well as possible the way people contributes to OSM. Of course a lot of other variables can be designed. We encourage you to think about it if interested in the topic!

In the next blog post, we will see how to use this new information to group OSM users, with the help of some machine learning well-known procedures.