At Oslandia, we like working with Open Source tool projects and handling Open (geospatial) Data. In this article series, we will play with the OpenStreetMap (OSM) map and subsequent data. After a first blog post focused on our working framework, here comes the second article of this series, dedicated to the presentation of OSM data itself.

Introduction

As you should know, OpenStreetMap is a project which creates and distributes free geographical data for the world. Like Wikipedia, it’s a community of people who can create and update some content available for everyone. Thus, anyone can edit buildings, roads, places or even trees and mailboxes!

Working with community-built data forces to take care of data quality. We have to be confident with the data we work with. Is this road geometry accurate enough? Is this street name missing? This is fundamental for companies and NGO who use OSM on a daily basis.

Our main purpose is to answer to this question: can we assess the quality of OpenStreetMap data? (and how?) This post is dedicated to the data gathering, from native OpenStreetMap API.

What do we mean by « data quality »

Let’s remind you that OSM is a community, like Wikipedia, where everyone can create, edit and delete entities. You can suppose that the quality of a contribution depends on the user who made it.

If the user is experienced, you can suppose that the contribution should be good. And when you don’t have any reference data to measure the data accuracy, you can suppose that if the road was created a few years ago with 20 updates, it should be complete and accurate enough.

Geospatial data quality components

Van Oort (2006) defines several spatial data quality criteria:

lineage
positional accuracy
attribute accuracy
logical accuracy
completeness
semantic accuracy
usage, purpose, constraints
temporal quality
variation in quality
meta-quality
resolution

This classification has been recalled in further contributions, however most studies focus on positional accuracy. For instance, Haklay (2010), Koukoletsos et al. (2011) or Helbich et al. (2012) compared OSM data with Ordnance Survey data, an alternative data source considered as a ground truth.

There are two differences with our approach: we don’t have any geospatial data reference to cope with the positional accuracy. Moreover the authors decided to take a snapshot of the OSM data instead of using OSM history data.

OSM contributors and data quality

Other references focus on OSM contributors. Arsanjani et al. (2013) classified OSM contributors based on the quality and quantity of their contributions in Heidelberg (Germany). Five classes are used: « beginner », « regular », « intermediate », « expert », and « professional mappers ». The authors work with reference data in addition to OSM data, they can assess the positional accuracy of a contribution. Moreover, they take into account the completness and the semantic accuracy. Then Neis et al. (2014) proposed a whole set of statistics dedicated to OSM contributors. They provide hand-made groups, and characterize contributions regarding dates, hours, user localisation and activity.

Additional references can be mentionned to overcome the OSM data qualiy issue. The ISO/TC 211 working group published a set of norms for geographical information standardization. For instance, the norm
ISO19157:2013 (2013) cites some of quality attributes mentionned above. See also the Wikipedia notice about the OSM quality assurance which lists several tools to supervise the OSM data construction.

From the OSM history dumps to usable data sets

Extracting OSM data is a simple but complex task.

simple because you just have to download the history dump in pbf (Protocol Buffer) or osh formats from Planet OSM website (osm format refers to latest data, whereas osh refers to history data).
complex because when you want to extract data, it can be a long and tedious task

The pbf (Protocol Buffer) file format is quite big: ~57Go. Note that the xml file is compressed with bzip2. It can be long (+36 hours) and take some place (1TB) if you uncompress it (see more on OSM wiki).

The challenge here is to pass from these native format to in-base data or .csv files. Several tools exist to accomplish this effort: osm2pgsql, osmosis, osmium-tool or osmium. We propose here to use the latter, and its dedicated Python library. This Python extension can be installed through apt-get:

sudo apt-get install python-pyosmium

…or via pip:

pip install pyosmium

What sort of data are behind the OpenStreetMap API?

Pyosmium documentation is a rich source of information in order to understand the pyosmium library functioning. Several features can be identified within the OSM data.

Within the OSM API, a set of OSM seminal entities can be easily identified:

nodes, characterized by geographical coordinates;
ways, characterized by a list of nodes;
relations, characterized by a set of « members », i.e. nodes, ways
or other relations.

In addition to these three element types, a fundamental object is the change set. It describes a set of modifications done by a single user, during a limited amount of time.

Each of these OSM objects are characterized by a set of common attributes, that are IDs, timestamps, visible (is the object still visible on the API?), user ID, or a list of tags (a tag being the association between a key and a value).

Starting from these OSM elements, we can straightforwardly answer typical questions like:

How many nodes does each user create?
How frequent are the mofifications for each contributor?
How many tags does each OSM element contain?
…

Considering the history of OSM data makes the data set even more complete: it allows us to study the temporal evolution of the API.

Conclusion

The OSM data features are full of information. After extracting them, we plan to use them in order to characterize the OSM data quality, as described above. It will be the aim of further articles: the next one will address the OSM data parsing.

References

Arsanjani, J, Barron, C, Bakillah, M, Helbich, M. 2013. Assessing the quality of OpenStreetMap contributors together with their contributions. Proceedings of the AGILE. p14-17.
Haklay, M. 2010. How good is volunteered geographical information? A comparative study of OpenStreetMap and Ordnance Survey datasets. Environment and planning B: Planning and design. 37(4),
p.682-703.
Helbich, M, Amelunxen, C, Neis, P, Zipf, A. 2012. Comparative spatial analysis of positional accuracy of OpenStreetMap and proprietary geodata. Proceedings of GI Forum. p.24-33.
ISO. 2013. Geographic information: data quality. ISO19157:2013. Geneva, Switzerland: ISO.
Koukoletsos, T, Haklay, M, Ellul, C. 2011. An automated method to assess data completeness and positional accuracy of OpenStreetMap. GeoComputation. 3, p.236-241.
Neis, P, Zipf, A. 2012. Analyzing the contributor activity of a volunteered geographic information project: the case of OpenStreetMap. ISPRS International Journal of Geo-Information, Molecular Diversity Preservation. 1, p.146-165.
Van Oort, P. 2006. Spatial data quality: from description to application. PhD report. Wageningen Universiteit

OpenStreetMap Data Presentation: Which Data, How to Get Them?