Almost final report for KIMA
This is an almost final post for what we now refer to the Kima phase 1, the phase which was funded by the Pelagios research grant and was presented here by the two previous posts, the introductory and the intermediate report. The third and recent stage of our work on Kima is, as we now realize, very far from being the final stage. There will be not only more data to add, more ways to explore the data, more functions to develop – there also, as will be explained below, constant verification, validation, revision and refinement of the data with each further development.
Since our previous, intermediate report at the end of November we have presented our work at the Madrid Linked Pasts meeting, then in the Leipzig conference on Digital Infrastructure for Named Entities Data mid January. We submitted a poster based on our work to DH2017 in Montreal. In our winter presentations we described our workflow and schema, both of which you could find in our documentation. We also gave examples of two ways to explore our attestation based historical gazetteer: a query could relate to a specific text, a selection/collection of texts, and view the distributions of place names used in it, while ignoring the specific variants of these names. Look for example at these two word clouds:
The figure on the right shows the 100 most frequent words of the Hebrew Bible, compared to the figure below, with the 100 most frequent words of the Mishnah, the first text of the Oral Jewish tradition that was redacted around 300 AD. Much has happened in the centuries between these two canonical texts, and this can be seen in the comparison:
while in the bible the most dominant names (except Jerusalem in the second place) are the names of the imminently dangerous empires of the ancient east and of the neighboring others: Egypt, Babylon, Assyria, Aram, Gilead, Shomron and Edom, the geographic consciousness of the Mishnah is no longer of a political entity. After the ruin of the temple, the loss of sovereignty and the dispersion of the nation Jerusalem, the by then ruined center, rises to be the significantly primary place in the discourse, while the next entities, Judea, Galilee and Syria, regions entities that both reflect Roman administration and attempts at reorganizing the religious practice in the new geographic reality of the Palestinian province.
A different kind of query makes use of the diachronic aspect of the corpus and takes the linguistic and orthographic variance as a resource, a feature, rather than a bug to overcome:
In the figure above one can see the variant names used during the 19th century for the city of L’viv, or Lvov, or Lemberg, as they were written in the Hebrew script in the Yiddish(light green) and Hebrew languages, and according to the German(dark green) or Russian(purple) forms. It would be interesting to compare this toponym history with the one used in Latin, Gothic or Cyrillic scripts, in the various related languages and cultures, and see how either of them related to each other the political history of the region.
We truly expected to conclude the work below in the two weeks following the Madrid conference, or the Leipzig one – by the end of January, we thought to ourselves, we would have matched the historical places in our gazetteer against Geonames for their coordinates, then do some final cleaning of the data, and be done. Little did we know.
Other projects kept demanding our attention, our time is short and sparse, and our data is big (hundreds of thousands of attestations), and messier than we expected. Rather than using this to excuse our tardiness in getting to this stage, it is worthwhile describing the complexities we encountered, and you can read about it in our next, really final (though only for this stage) post.