After a set of blog posts published the last summer, we are glad to announce that the dedicated code (version 1.0) has been released on Github.
Our OpenStreetMap history data pipeline will let you analyze user contributions through time, following several modalities:
- evaluate the area evolution in terms of nodes, ways and relations;
- investigate on OSM tag genome, i.e. the whole set of tag keys and values that are used in OSM;
- classify the OSM users starting from their contribution, with the help of unsupervised learning techniques (Principle Component Analysis and KMeans).
How to use the code?
First of all you can clone the Github repository:
git clone https://github.com/Oslandia/osm-data-classification.git .
We developped the code with Python 3, and based our Python data pipeline on Luigi library. Hence the command must explicitely refer to a Luigi operation. The command structure must be as follows:
python3 -m luigi --local-scheduler --module <--args>
Or alternatively:
luigi --local-scheduler --module <--args>
if Luigi is in your Path.
These commands must be run from the src
directory. You can also run them elsewhere if you add the src
directory to your PYTHONPATH
variable.
The possible command arguments are of two sorts:
- Luigi-focused command-line arguments (
--local-scheduler
,--module
,--help
); - Task-focused command-line arguments (in such a case, the list of arguments may be printed by running Luigi on the specific task with the
--help
argument)
Data gathering
At the beginning of the data pipeline we need an OSM history file (with the .osh.pbf
extension). This kind of file can be downloaded for instance from the GeoFabrik website.
Let suppose that we are at the project root, we can get a small dataset as an example:
wget http://download.geofabrik.de/europe/isle-of-man.osh.pbf ./data/raw
We store it into the
./data/raw
directory. Be careful to the path thing… The default argument for data repository is --datarep="data"
. Regarding the place where you run the code from, you potentially will have to change it.
Result production
Some Luigi tasks (especially the KMeans-related ones) produce some extra CSV files during the analysis. They will be stored into datarep/output-extracts/
.
Some additional comments are in the README of the project on Github. If you have questions when using the project, please contact us by email (infos+data@oslandia.com) or directly with Github issue system. If you want to add some new features to the code, do not hesitate to contribute!