KIMA02: Towards a Sustainable Gazetteer
The Pelagios Resource Development Grant of the first round has enabled us to launch the project Kima, a Hebrew script, attestation-based historical gazetteer. The resulting resource was a promising database, which was, however, still unbalanced and required more work in order to make it usable as an encompassing, multipurpose gazetteer. We were thrilled, then, to hear that our application for the second round was successful.
The second RDG will enable us not only to consolidate the gazetteer with data entry through OCR and OCR correction of two large print gazetteers, and of the annotation, using Recogito, of the place names in two bilingual editions of medieval travel narratives. Furthermore, more than a rich resource in the Hebrew script, it will enable us to offer a scalable contribution to any gazetteer and extention to Recogito by developing work flows for gazetteer building through Recogito. We will expand here on three aspects of the work: populating the gazetteer, matching and geocoding, and finally, opening and sustaining the gazetteer.
The present Kima collection contains a substantial body of Ancient and late
antiquity attestations with a ‘tail’ of Medieval and early Modern Attestations, as well as a very large Early-Modern and Modern, catalogue-based body of attestations. The Reason for the gap in medieval sources relates partly to historical reasons, and partly to the working procedure and policies of the Academy of the Hebrew Language, one of our main data providers. In the proposed phase of the project we would like to fill in the gaps in order to make Kima a more balanced, representative and multi-lingual corpus. Hence, the data entry stage of the work comprise of the digitization, OCR and OCR correction of two large print gazetteers, and of the annotation, using Recogito, of the place names in two bilingual editions of medieval travel narratives.
The bulk of medieval Hebrew script writing was done in non-Hebrew languages: first in
Aramaic, the language of Rabbinic literature, and later in Judeo Arabic: A Arabic
Language written in the Hebrew script. Both of these corpora are outside the purview of the academy of the Hebrew language. Fortunately, we have two expansive print
gazetteers covering the relevant corpora that can thus help us complement our Corpus
and achieve more historical, geographical and linguistic continuity with other Semitic
language gazetteers such as Syriaca and al-Thurayya:
Gottfried Reeg, Die Ortsnamen Israels nach der rabbinischen Literatur .
Ludwig Reichert Verlag, Wiesbaden 1983
Entries in the gazetteer include rich bibliographies, discussions on identification and suggestions for localization, as well as coordinates for the TAVO map. For reasons of copyright, only the names, variants and references to the primary sources where they appear will be entered into our database, and the index will be used to align the names with their Hebrew, Greek and transcribed forms.
Moshe Gil, Bemalchut Yishmael Betekufos Hageonim (Hebrew) Tel Aviv University and Bialik Publishing, Jerusalem 1997
The place index to the edition of Genizah manuscripts, in Volume 4, has normalized and vocalized forms leading to a document number. Both will be entered along with the original form as it appears in the (diplomatic) document page.
Another important addition to the corpus would be medieval itineraries; while the academy of the Hebrew Language has managed to include in their annotate corpus every Hebrew extant text until the end of the 11th century, the most fascinating geographical literature of the following centuries, including the famous travels narratives of Benjamin of Tudela and Petachia of Regensburg, is yet to be included. Translated editions of these texts are available in the public domain:
The Itinerary of Benjamin of Tudela; Critical Text, Translation and Commentary by Marcus Nathan Adler. London, Oxford University press 1907
Travels of Rabbi Petachia of Ratisbon; Translated from the Hebrew, and published, together with the Original on opposite pages by Dr. A. Benisch. London, Messrs. Trubner & co. 1856
The work on the two edition would include scanning, OCR and OCR correction, encoding using Recogito and XML basic structural annotation. This would enable later alignments with Christian and Muslim travel and pilgrimage literature and the
development of more elaborate schema for travel literature, in a more advanced stage.
We also intend to use the encoding of these two texts as preparation for students assignments for the Introductory DH courses Sinai Rusinek wilI teach starting November 2017 at the Haifa University and Bar Ilan University
Matching and Geocoding
Once the Aramaic and Judeo-Arabic additions to the KIMA ancient corpus will be made, we can proceed with the more challenging mission of matching the place names gathered in each collection with each other. Building on the lessons from the work on Kima Modern, parallel mappings of the collections to contemporaneous gazetteers will serve as quality assurance for the matching (see Kima final blog post ). This time the the matching will be done against historical gazetteers – Pleiades, Trismegistos, TIR and Syriaca. The collaboration with the groups working on and maintaining these gazetteers will include data exchange, which was partially discussed and agreed on already in previous Pelagios meetings in Madrid and Leipzig.
Opening and Sustaining
While the data at the basis of the gazetteer is already available for download and reuse,
and the code is available on GITHUB, we aspire to advance the openness and
sustainability of KIMA by making it available in database form which can be queried
through an API, on a basic website. This will enable KIMA to become a useful reference
tool and application and the Kima ID’s to become persistent URLs for open usage. In
addition, the prototype Recogito plugin, which was developed by Dimid, in consultation
with Rainer Simon, should be further developed into a robust, generic plugin which can
support the work of other gazetteers in the future. We propose doing this by adding the
A. Recognition of composite place names
B. Fuzzy search options
C. elaboration of the annotation and validation process to enable export of the results
with statistics regarding to positive, false positives, and false negatives results.
D. Characterizing an additional component, which will enable update of the gazetteer
with information conforming to our schema.
The two last features can be used for constant manual updating of the gazetteer with the
texts contributed by our data providers, and strengthen the connections between
Recogito and its hosted gazetteers, so that recogito is used also as feeding mechanism
for the gazetteers or their underlying databases. The development will be done in
concord with the annotation of the two travel narratives, which will be used as use cases
for interactive annotation /gazetteer updating. Thus, a workflow will be modeled, in which
a series of questions regarding each new tag will be asked:
Was this a place name automatically identified? If not, was a similar form offered, of
which it is a variant? If yes, is the date of the annotated text within the registered date
span of the variant? if it was not suggested by the gazetteer, could it be (manually)
identified as a place which appears in the gazetteer? These questions will structure a
decision tree for additions of the annotated forms into the database as attestation of an
existing variant, as new variant of an existing place, or as a new place in the gazetteer,
which will then require additional verification and matching attempts.
The combination of features C and D will also support the NLP training of NER
recognition, whether rules-based or through machine learning and thus the continuous
improvement of the plugin.