Open Sourcing Touristic POI Database – Questions Around Format, Interest

We’re planning to open source our touristic POI Database (currently 1.4 Million points worldwide). There is some effort involved in generalizing it from our internal format so I wanted to confirm that a) there is interest in it as well get some feedback on the format. I’ve also outlined the process of creating/ updating the dataset, as it gives some insight what to expect from the dataset and if it interests anyone, probably the people in this sub.

POI data points

Location (mandatory) Category (mandatory, more on that later) Name Images ( designated thumbnail with blur hash, all with (permissive licensing information) Localizations (consisting of a name, teaser and description in one of the supported languages, availability depends) Rating (mandatory, more on that later) Source (mandatory, such as Wikidata, OSM, tourism council etc.) Type (most POIs are individual sights but „special“ POIs such as places ie cities/towns exist ) Parent (if it exists, a „special“ poi such as a city or town ) Links/References (links to Wikidata entity, Wikipedia/Wikivoyage articles in different languages but also links to social media (fb, ig, twitter etc.), booking sites (agoda, booking, hotels.com etc. ) or relevant 3rd party sites such as Trip Advisor, Atlas Obscura etc.. Misc. Properties: Webaddress Telephone Zip Code Opening Hours Heritage Designation (UNESCO, UK Grade I building ) etc. More depending on the source

We derive our content from many different sources, some of them we simple map to the above format (especially those derived from regional or country level Tourism councils ). The bulk is however combined from Wikidata, Wikipedia, Wikivoyage and OpenStreetMap in the following manner.

Process

Process the complete Wikidata Dump, filtering out all entities that possess a geocoordinate and an instance of-claim. The instance of claim is then checked against a list of touristically relevant classes. Note: This claim can be very specific such as olive sand beach or agricultural theme park so that we expand our list of touristically relevant classes (ie beach and amusement park) to include the descendant subclasses. We get a lot of structured information from this source (especially links to other sites) but little in description, images etc. Process all linked articles in the different language versions of wikipedia/wikivoyage (at the moment we look at the English, German, French, Spanish, Italian, Portuguese and Polish sites). Extract teaser and shorter excerpts for descriptions (Localizations) as well as images with their respective licenses. Clean-Up low quality & unspecific images Assign Parents depending on the “located in adminstrative Region” – claim to “special” POIs (cities, towns), the assigned pois then form an area that are used to assign further Pois in that area to the same parent.

Two things would require some work: category and rating. We map information from sources to an internal category representation. It is binary, fast to filter with bit masks but not very flexible and probably not that easy to use. For the open source version I was thinking of creating a taxonomy somewhat similar to the one Foursquare uses but other suggestions are appreciated.

The rating combines a somewhat objective data quality rating (amount of images, links to wikipedia articles, length of descriptions etc., types of properties present) with a biased weighting of categories (among other information) that fits our use case. We also use user reviews/rating but that wouldn’t be part of the dataset. We could use a slightly more generalized aggregate rating and/ or different rating components but more likely than not you would want to use your own weighting if your use case is sufficiently different, so I guess I am wondering what expectations or requests there are here.

Export Formats

TSV and GeoJSON Feature Collections but open to suggestions.

submitted by /u/berlumptsss
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *