Linking Syriac Geographic Data Working Group: Fuzzy Matching and Data Reconciliation
In order to achieve efficient place name reconciliation between unannotated text on the one hand, and the Syriaca gazetteer on the other, our research group has further experimented with NLP tools for this purpose.
As in most languages, place names give rise to variation of writing to refer to these places. Syriac, being a semitic language, has most variation in the plene or defective writing, which refers respectively to writing with or without matres lectionis, the so-called reading mothers, consonants used as constraints to determine the reading of vowels. These matres lectionis are not necessarily written, hence the formation of a source of written variation. An clear example of this variation from the Book of the Laws of the Countries (BLC) would be the gentilics for Britain. The standard form is ܒܪܝܛܘܢܝ (brytwny), although a form using additional matres lectionis, ܒܝܪܘܛܘܢܝ (byrwtwny) is also read. In these words, y and w are used as reading mothers respectively.
Other, less obvious variation from a syriacist point of view attested in BLC, is the variation in phonologically close consonants, which provide a variation on the common and expected root. An example from our text would include two references for Egypt, respectively the location name (ܐܓܦܛܘܣ, ’gptws) and the derived gentilic (ܐܓܒܛܝ, ’gbty). Here, the P and B consonants are interchangeably, which is not common in Syriac.
We wrote a short Python program which turns the Syriaca locations into a workable Python array, which serves as the basic data to reconcile against. Using the fuzzywuzzy data package, we match our text against the Syriaca entries, to find to what extent we can automatically extract the named entities, found in Syriaca, from a concrete text. After preliminary testing, we have found that a threshold score of 85 provides a thorough baseline for location matching, leaving enough room to account for small variation in matres lectionis and prepositions as discussed above, keeping enough rigour not to introduce noise.
In the case of the Book of the Laws of the Countries, we can use the following short bash script to verify to what extent the automatic named entity recognition based on the Syriaca database reflects manual annotation provided by Dirk Bakker in his PhD dissertation.
Other reconciliatory issues we have as yet left unaddressed, such as the problem that place names can also have parallel, unrelated meanings. A clear example from the Book of the Laws of the Countries would include ܫܝܪ, being the standard term referring to China, but which also refers to a Syriac saint. Another example from our text would be ܦܘܪܣܝ or ܦܪܣܝ, the gentilic name referring to the Persian identity, which nevertheless is also the root for words in the semantic field of laying bare and nakedness.
These problems are not yet of prime importance as long as we limit ourselves to the text of BLC, but these will become prominent once we apply our methods to a bigger corpus. In the LinkSyr project, a parallel effort to develop NLP tools for bringing ever more Syriac resources into the Linked Data cloud, we are working towards an efficient morphological analyzer which allows us to more clearly prepare the data for reconciliation with the Syriaca gazetteer. How we resolve these issues will form part of our next blog post. Further work will also seek to valorize the inherent hierarchical structure of the Syriaca gazetteer, rather than the flat reconciliation which we focussed on for now.