Kima phase one hangover: Georeferencing, Evaluation, Validation and more refining
As mentioned in the previous post, Georeferencing our places, which we expected to be a simple script run and some tweaking of the data, turned out to be a more complex and challenging mission, but with a lesson and a bonus way of validating our work.
Approaching the task of georeferencing our places, we first had to admit that in our attestation based toponym corpus, we had two very different data sets: the part that was based on tagged texts, mostly ancient, and the part that was based on printed book catalogues (hence, 15c and later). Not only the sources and method are different: the two datasets represent very different phenomena: Toponyms in discourse, and the world of Hebrew print. They should therefore not be used as one database. To avoid anachronistic mis-identification, geocoding both datasets would also require a different type of work: ancient Toponyms would have to be matched against sources like Trismegistos, TIR and Pleiades; modern ones would be rather matched against Geonames. We are still in process of acquiring the ancient datasets for matching.
In the first attempt to match “Kima Modern” against Geonames we tried to match the English Primary forms of our places against the entire Geonames database. In a preliminary examination of a sample of the results we noticed the multiplicity of suggested matches for each form. Aiming for more precision, we limited the search to the Geoname feature class P, which contains cities, towns, and other types of settlements (as opposed to other types of geographic features such as regions, or rivers). We could do this because the place names in the Modern dataset were all places where books were printed.
We still remained with many non-matched place names, for which we attempted another method: matching all Hebrew variants against Geonames. In addition to over 900 places that were found with the first methods, 200 additional Hebrew variants were matched, still leaving us with quite a few non-matched place names, as well as places where there was a multiplicity of match. We were facing long hours of manual selection and retrieval of coordinates.
Recall of coordinates and multiple results were, however, not our only problem. In the course of perusing the results we realized that many of the automatic matches were wrong. Whether the cause of misidentification was in our catalogue, whether it was our failure in refining the data or whether it was due to a limitation of Geonames, we realized that we can no longer trust the data as we were hoping to.
The solution presented itself in retracing our steps: this time we matched all attestations against Geonames, while at the same time matching the primary English form, and included in the script a comparison between the extracted geonames. This time, the statistics were better: though we still remain with 50% results that were only matched with one method, we managed to reduce the records that would have to be manually geocoded to 6% of the attestations (15,948 attestations, which correspond to over 1500 variants and just under 600 places). Moreover, we managed to have automatic validation for the identification of 44% of our attestations – over 100,000, corresponding to 500 places. The statistics here represents, in fact, the level of certainty in our identifications: the unknown, the certain, and the still to be validated.
And now what?
Our next step (hopefully this coming week) would be to try and match our data again, this time against Wikidata. We will use again the power of cross checking and parallel matching as validation – not only between Geonames and Wikidata, but as a way to validate as much of the data as possible.
We expect to have our data geocoded before the Passover holiday, and we will store them in this folder, where you can all view and download it. Until then, you can find here the yet-to-be-revalidated and yet-to-be-georeferenced data. To follow the code, look in Dimid’s Github here.