Linking Syriac Geographic Data Working Group: Place Name Detection and Comparison with Hebrew (Final Report)
From the very beginning, the LinkSyr project has been interested in bringing together original Syriac texts with named entity databases. Our last blogpost detailed the process of applying fuzzy matching techniques to the data, in order to automatically retrieve place names from newly digitized corpora, which do not have annotated information. We concluded that more research would be needed in order to on the one hand do manual comparison of the retrieved results with a human indication of a set of texts concerning their place names, and on the other hand an evaluation of the matching process itself. We have made progress on both sides.
First of all, the manual control phase has been greatly aided by the additional work of Srećko Koralija, who manually read through the Syriac Pentateuch (the first five books of the Bible; Genesis, Exodus, Leviticus, Numbers and Deuteronomy) to provide us with a list of place names, including the attested variation between them. Needless to say, his work provides an important baseline against which we can score the results of our work up unto now. In time our collaboration will be able to provide data from the entire Syriac Bible (the Peshitta).
Secondly, we have been able to compare our fuzzy matching approach for Syriac with a Hebrew dataset. At the Linked Pasts meeting in December 2018, we met Sinai Rusinek, who informed us on her work on Kima, a previous Pelagios Research Development project, which has similar goals to our own, only applied to the Hebrew Bible. Using her data, we could perform a cross-lingual test of our algorithms.
The results from this comparison prove that we are in need of a better training corpus for Syriac than the one currently provided by the Syriaca gazetteer, which provides ample resources for Syriac places in general, but unfortunately less so for the oldest period in which our project is interested. For this reason we have discovered the need to find enough data which can be algorithmically manipulated in order to lower the manual work needed to significantly increase the richness of our current database of place names.
Referring back to the issues we have raised in the previous blogpost about finding a good threshold for finding the most geographical names, without a too big increase in noise retrieved, we have to find a trade-off between two points. This test was performed on the Book of the Laws of the Countries, giving as input a general list of Syriac place names. Putting the threshold for fuzzy matching at 80% retrieves 84,2% true positives, and 15,2% false positives. In other words, this means that of the results which the algorithm indicates as place names, 84,2% are actually place names, and 15,2% are not. We have to lower the threshold to 55% in order to find all place names, including the Egypt example discussed in the previous blogpost. In this case, the true positives are only 64%, with 36% false positives. This means, in other words, that more noise is generated, but that we do find more place names.
A strong contrast is found with the Hebrew data of the Kima-project, where we only need a threshold of 85% to retrieve all place names. We found this after randomly cutting up the Hebrew data and trying to find the place names in the other dataset. The reason for this better result is that the basic set of place names is much larger, giving the algorithm a better understanding of the type of names to be found, which in turn results in more reliable results. This leads us to conclude that our algorithm works well, but that we need to focus on having a better base line of Syriac place names, which for the moment are still lacking.
Bringing all this information together we can summarise our findings throughout the project as follows. Given our comparison of Syriac and Hebrew data, we have found out that our method of regular matching and fuzzy matching retrieves good results (as described in the previous blogpost), but that the initial data for our Syriac resources needs to be improved. Given the place name data we have at our disposal at the moment (Syriaca, Pentateuch, BLC), we can use our test for a first manual checkup of a larger Syriac corpus. Given little manual work we will be able to significantly improve our database of place name attestations, and discover more variants for the ones we already have at our disposal.