OSM data classification: code release

By mercredi 27 décembre 2017Data

After a set of blog posts published the last summer, we are glad to announce that the dedicated code (version 1.0) has been released on Github.

Our OpenStreetMap history data pipeline will let you analyze user contributions through time, following several modalities:

  • evaluate the area evolution in terms of nodes, ways and relations;
  • investigate on OSM tag genome, i.e. the whole set of tag keys and values that are used in OSM;
  • classify the OSM users starting from their contribution, with the help of unsupervised learning techniques (Principle Component Analysis and KMeans).

How to use the code?

First of all you can clone the Github repository:
git clone https://github.com/Oslandia/osm-data-classification.git .

We developped the code with Python 3, and based our Python data pipeline on Luigi library. Hence the command must explicitely refer to a Luigi operation. The command structure must be as follows:
python3 -m luigi --local-scheduler --module <--args>
Or alternatively:
luigi --local-scheduler --module <--args>
if Luigi is in your Path.

These commands must be run from the src directory. You can also run them elsewhere if you add the src directory to your PYTHONPATH variable.

The possible command arguments are of two sorts:

  • Luigi-focused command-line arguments (--local-scheduler, --module, --help);
  • Task-focused command-line arguments (in such a case, the list of arguments may be printed by running Luigi on the specific task with the --help argument)

Data gathering

At the beginning of the data pipeline we need an OSM history file (with the .osh.pbf extension). This kind of file can be downloaded for instance from the GeoFabrik website.

Let suppose that we are at the project root, we can get a small dataset as an example:
wget http://download.geofabrik.de/europe/isle-of-man.osh.pbf ./data/raw
We store it into the ./data/raw directory. Be careful to the path thing… The default argument for data repository is --datarep="data". Regarding the place where you run the code from, you potentially will have to change it.

Result production

Some Luigi tasks (especially the KMeans-related ones) produce some extra CSV files during the analysis. They will be stored into datarep/output-extracts/.


Some additional comments are in the README of the project on Github. If you have questions when using the project, please contact us by email (infos+data@oslandia.com) or directly with Github issue system. If you want to add some new features to the code, do not hesitate to contribute!