Aggregating Kima: an intermediate report
With not so fashionable delay, I am reporting now on the second phase of our work. This phase started in October, when in the midst of the Jewish holiday deluge, our data whiz, Glauco Mantegari, came to Jerusalem to work with us on KIMA, the Hebrew Gazetteer.
Having no center or institution to call our own, we were working in various chosen locations in Jerusalem and Tel Aviv, and the names of the places we frequented, with their stories, echoed our work. It is only appropriate, therefore, to start with the picture of Glauco Mantegari refining data on the slope of the notorious valley of the son of Hinnom, גיא בן הינום, which, with the ages, gave its name to the idea of hell and the purgatory: Gehenna. The evening was warm and pleasant, but on the screen, the purgatory of google refine was hard at work in an heroic attempt to bring order and meaning to a gigantic mess of historical data.
The intensive week of work in Jerusalem started with meeting our data providers, in order to discuss with them their conventions of the data. Each provider had other ways of formulating temporal information, certainty, accuracy and historicity (e.g. expressions “Not before”, “around”, “probable” or “cataloger addition). First, was the Academy of the Hebrew Language. The Academy was established in 1953 with the mission “to direct the development of Hebrew in light of its nature”. One of its core projects is the preparation of a Hebrew Historical Dictionary. For this purpose, it created a database of texts of the different historical language strata, where each word in each historical text is analysed for its syntactic and semantic features. The corpus includes texts of all the extant Hebrew compositions from the time of the canonization of the Hebrew Bible until the end of the Geonic period, some Medieval Hebrew texts and large selections of Hebrew literature from the mid-18-th century until the founding of the State of Israel.
From this cornucopia of language we received a total of 68108 textual attestations for 3,678 unique place names, in various forms and spellings. We have complemented these with 6, 351 attestations of place names from the Hebrew Bible, which were kindly extracted for us by Dirk Roorda and Martijn Naaijer from SHEBANQ. The biblical place names were collected from a digital version of the Biblia Hebraica Stuttgartensia (BHS) which was made available in the text database of the Hebrew Bible behind SHEBANQ, a system for the study of the Hebrew Bible.
While the data we received from the Academy and from SHEBANQ was extracted from the body of texts, our second main data source is of different nature altogether: the library catalog. Our provider here is the National Library of Israel, which made available two catalogs and one large Thesaurus of authority files, titled “Agron”. The first catalog, the Bibliography of the Hebrew Book, is the fruit of many years of work on a project that documented over 100,000 records of known printed works of Jewish languages, found in collections in collections in Israel and abroad, and which were printed from the time of early press in the mid 15th century to 1960. The second is the catalog of the National Library itself, that amounts to 300,000 Hebrw records.
What makes these catalog records valuable for a Gazetteer is the librarians’ loyal practice of documenting, in a designated field (260a MARC record) the historical name of the place of publication as it was written at the time of printing, normally on the title page. Conveniently for us, in many cases a normalized form of the place name was entered by the cataloger in a separate field, and the most fortunate cases are when they were linked to a normalized, authority record place name, such as that kept in the library’s “Agron”.
Having learnt from our data providers their various conventions, the next step was conjuring a Geo-Json Schema that would enable to translate the data from the various sources and aggregate it in one structure. The weeks that followed were dedicated to translating the data from the various sources to the schema, adjusting it when needed, and finally, matching and joining them together. This is where challenges of messy and big data surface, from hidden encoding variations (even within one encoding system!) to our computers protesting and fainting from the hard computation labor. But we will prevail! the first version of the core of our Gazeteer will be at Pelagios soon.
CONNECTING TO RECOGITO
While assisting the Gazeteer making processes by scripting and parsing, Our team’s developer Dimid Duchovny was also working with Pelagios’ Rainer Simon on adapting a plugin for Recogito to enable automatic, as well as manual mark up of Hebrew place names.
To test the plugin, I evaluated a preliminary automatic mark up of a medieval text that I manually marked in advance: the Journeys of Rabbi Petachia of Regensburg. This revealed the predicaments of Hebrew NLP: first and foremost, the lack of vowel letters creates multiple ambiguities: the word אולם, for example, may be read as the name of the German city of Ulm, but also as “Ulam” the Hebrew word for “but”. This is a problem that could only be reduced by morphological analysis of each text, or by automatic addition of vowel diacritics.
A second interesting predicament, caused by Jewish and Israeli geographical history, is apparent in the mark up: several person names mentioned in the Medieval text, such as Amazia, Tuval and Rabbi Petachia’s own name – all traditional, biblical person names – are detected and marked as place names in the text. Indeed, in the 20th century, newly established Kibbutzim and Moshavim were named after ancient kings and heroes, thus creating a challenge for linguistic disambiguation. And vice versa: many modern Hebrew names are given after place names, whether biblical or not. This is a problem that could be solved in future versions of Recogito, if a choice will be given to select temporal subsets of the gazetteer.
For both these problems, at this point, we have to rely on manual correction through Recogito’s validation function. Having the data openly available, however, we hope it will attract NLP scholars who would take the challenge of training their named entity recognition software and applying it to KIMA.