OSM user classification: the unavoidable metadata normalization

At Oslandia, we like working with Open Source tool projects and handling Open (geospatial) Data. In this article series, we will play with the OpenStreetMap (OSM) map and subsequent data. Here comes the sixth article of this series, dedicated to user metadata normalization.

1 Let’s remind the user metadata

After the previous blog post, we got a set of features aiming to describe how OSM users contribute to the mapping effort. That will be our raw materials towards the clustering process. If we saved the user metadata in a dedicated .csv file, it can be recovered easily:

import pandas as pd
user_md = pd.read_csv("../src/data/output-extracts/bordeaux-metropole/bordeaux-metropole-user-md-extra.csv", index_col=0)
user_md.query('uid == 24664').T

uid                                   24664
lifespan                        2449.000000
n_inscription_days              3318.000000
n_activity_days                   35.000000
n_chgset                          69.000000
dmean_chgset                       0.000000
nmean_modif_byelem                 1.587121
n_total_modif                   1676.000000
n_total_modif_node              1173.000000
n_total_modif_way                397.000000
n_total_modif_relation           106.000000
n_node_modif                    1173.000000
n_node_modif_cr                  597.000000
n_node_modif_imp                 360.000000
n_node_modif_del                 216.000000
n_node_modif_utd                 294.000000
n_node_modif_cor                 544.000000
n_node_modif_autocor             335.000000
n_way_modif                      397.000000
n_way_modif_cr                    96.000000
n_way_modif_imp                  258.000000
n_way_modif_del                   43.000000
n_way_modif_utd                   65.000000
n_way_modif_cor                  152.000000
n_way_modif_autocor              180.000000
n_relation_modif                 106.000000
n_relation_modif_cr                8.000000
n_relation_modif_imp              98.000000
n_relation_modif_del               0.000000
n_relation_modif_utd               2.000000
n_relation_modif_cor              15.000000
n_relation_modif_autocor          89.000000
n_total_chgset                   153.000000
p_local_chgset                     0.450980
n_total_chgset_id                 55.000000
n_total_chgset_josm                0.000000
n_total_chgset_maps.me_android     0.000000
n_total_chgset_maps.me_ios         0.000000
n_total_chgset_other               1.000000
n_total_chgset_potlatch           57.000000
n_total_chgset_unknown            40.000000

Hum… It seems that some unknown features have been added in this table, isn’t it?

Well yes! We had some more variables to make the analysis finer. We propose you to focus on the total number of changesets that each user has opened, and the corresponding ratio between changeset amount around Bordeaux and this quantity. Then we studied the editors each user have used: JOSM? iD? Potlatch? Maps.me? Another one? Maybe this will give a useful extra-information to design user groups.

As a total, we have 40 features that describe user behavior, and 2073 users.

2 Feature normalization

2.1 It’s normal not to be Gaussian!

We plan to reduce the number of variables, to keep the analysis readable and interpretable; and run a k-means clustering to group similar users together.

Unfortunately we can’t proceed directly to such machine learning procedures: they work better with unskewed features (e.g. gaussian-distributed ones) as input. As illustrated by the following histograms, focused on some available features, the variables we have gathered are highly left-skewed (they seems exponentially-distributed, by the way); that leading us to consider an alternative normalization scheme.

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

%matplotlib inline
f, ax = plt.subplots(1,3, figsize=(12,4))
ax[0].hist(user_md.n_node_modif, bins=np.linspace(0,500, 25), color='r', normed=True)
ax[0].set_xlabel('Number of node modifications')
ax[0].set_ylabel('Frequency')
ax[0].set_ylim(0,0.05)
ax[1].hist(user_md.n_way_modif, bins=np.linspace(0,500, 25), normed=True)
ax[1].set_xlabel('Number of way modifications')
ax[1].set_ylim(0,0.05)
ax[2].hist(user_md.n_relation_modif, bins=np.linspace(0,500, 25), color='g', normed=True)
ax[2].set_xlabel('Number of relation modifications')
ax[2].set_ylim(0,0.05)
plt.tight_layout()
sns.set_context("paper")

2.2 Feature engineering

The basic idea we want you to keep in mind is the following one: if we find some mathematical tricks to express our variables between simple bounds (e.g. 0 and 100, or -1 and 1), we could design a smarter representation of user characteristics.

First of all you should notice that a lot of variables can be expressed as percentages of other variables:

the number of node/way/relation modifications amongst all modifications;
the number of created/improved/deleted elements amongst all modifications, for each element type;
the number of changesets opened with a given editor, amongst all changesets.

def normalize_features(metadata, total_column):
    transformed_columns = metadata.columns[metadata.columns.to_series()
                                           .str.contains(total_column)]
    metadata[transformed_columns[1:]] = metadata[transformed_columns].apply(lambda x: (x[1:]/x[0]).fillna(0), axis=1)

normalize_features(user_md, 'n_total_modif')
normalize_features(user_md, 'n_node_modif')
normalize_features(user_md, 'n_way_modif')
normalize_features(user_md, 'n_relation_modif')
normalize_features(user_md, 'n_total_chgset')

Other features can be normalized starting from their definition: we know that lifespan and n_inscription_days can’t be larger than the OSM lifespan itself (we consider the OSM lifespan as the difference between the first year of modification within the area and the extraction date).

timehorizon = (pd.Timestamp("2017-02-19") - pd.Timestamp("2007-01-01"))/pd.Timedelta('1d')
user_md['lifespan'] = user_md['lifespan'] / timehorizon
user_md['n_inscription_days'] = user_md['n_inscription_days'] / timehorizon

Finally we have to consider the remaining variables that can’t be normalized as percentages of other variables, or as percentages of meaningful quantities. How can we treat them?

We choose to transform these features by comparing users to any other: knowing that a user did 100 modifications is interesting, however we could also compare it with other users, e.g. by answering the question “how many users did less modifications?”. That’s typically the definition of the empirical cumulative distribution function.

import statsmodels.api as sm

def ecdf_transform(metadata, feature):
    ecdf = sm.distributions.ECDF(metadata[feature])
    metadata[feature] = ecdf(metadata[feature])
    new_feature_name = 'u_' + feature.split('_', 1)[1]
    return metadata.rename(columns={feature: new_feature_name})

user_md = ecdf_transform(user_md, 'n_activity_days')
user_md = ecdf_transform(user_md, 'n_chgset')
user_md = ecdf_transform(user_md, 'nmean_modif_byelem')
user_md = ecdf_transform(user_md, 'n_total_modif')
user_md = ecdf_transform(user_md, 'n_node_modif')
user_md = ecdf_transform(user_md, 'n_way_modif')
user_md = ecdf_transform(user_md, 'n_relation_modif')
user_md = ecdf_transform(user_md, 'n_total_chgset')

Consequently we can characterize a user with such new dashboard:

user_md.query("uid == 24664").T

uid                                24664
lifespan                        0.661534
n_inscription_days              0.896272
u_activity_days                 0.971539
u_chgset                        0.969609
dmean_chgset                    0.000000
u_modif_byelem                  0.838881
u_total_modif                   0.966715
n_total_modif_node              0.699881
n_total_modif_way               0.236874
n_total_modif_relation          0.063246
u_node_modif                    0.967680
n_node_modif_cr                 0.508951
n_node_modif_imp                0.306905
n_node_modif_del                0.184143
n_node_modif_utd                0.250639
n_node_modif_cor                0.463768
n_node_modif_autocor            0.285592
u_way_modif                     0.971539
n_way_modif_cr                  0.241814
n_way_modif_imp                 0.649874
n_way_modif_del                 0.108312
n_way_modif_utd                 0.163728
n_way_modif_cor                 0.382872
n_way_modif_autocor             0.453401
u_relation_modif                0.984563
n_relation_modif_cr             0.075472
n_relation_modif_imp            0.924528
n_relation_modif_del            0.000000
n_relation_modif_utd            0.018868
n_relation_modif_cor            0.141509
n_relation_modif_autocor        0.839623
u_total_chgset                  0.485769
p_local_chgset                  0.450980
n_total_chgset_id               0.359477
n_total_chgset_josm             0.000000
n_total_chgset_maps.me_android  0.000000
n_total_chgset_maps.me_ios      0.000000
n_total_chgset_other            0.006536
n_total_chgset_potlatch         0.372549
n_total_chgset_unknown          0.261438

We then know that the user with ID 24664 did more node, way and relation modifications than respectively 96.7, 97.2 and 98.5% of other users, or that amongst his node modifications, 50.9% were creations, and so on…

In order to complete the normalization procedure, we add a final step that consists in scaling the features, to ensure that all of them have the same min and max values. As the features are still skewed, we do it according to a simple Min-Max rule, so as to avoid too much distorsion of our data:

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler(quantile_range=(0.0, 100.0)) # = Min-max scaler
X = scaler.fit_transform(user_md.values)

3 Conclusion

This post shows you the work involved in machine learning procedure preparation. It seems quite short, however it represents a long long and unavoidable process with several try-and-error iteration loops to let the data do the talking (even an exciting job like data scientist has its dark side…)! We will exploit these new data in some machine learning procedures in the next post, dedicated to the user classification.