This post originally appeared on the Stoa Consortium blog at http://www.stoa.org/archives/2445.
Participants: Orla Murphy, Sarah Middle, Simona Stoyanova, Núria Garcia Casacuberta
The EDH and Pelagios NER working group was part of the Open Epigraphic Data Unconference held on 15 May 2017. Our aim was to use Named Entity Recognition (NER) on the text of inscriptions from the Epigraphic Database Heidelberg (EDH) to identify placenames, which could then be linked to their equivalent terms in the Pleiades gazetteer and thereby integrated with Pelagios Commons.
Data about each inscription, along with the inscription text itself, is stored in one XML file per inscription. In order to perform NER, we therefore first had to extract the inscription text from each XML file (contained within <ab></ab> tags), then strip out any markup from the inscription to leave plain text. There are various Python libraries for processing XML, but most of these turned out to be a bit too complex for what we were trying to do, or simply returned the identifier of the <ab> element rather than the text it contained.
Eventually, we found the Python library Beautiful Soup, which converts an XML document to structured text, from which you can identify your desired element, then strip out the markup to convert the contents of this element to plain text. It is a very simple and elegant solution with only eight lines of code to extract and convert the inscription text from one specific file. The next step is to create a script that will automatically iterate through all files in a particular folder, producing a directory of new files that contain only the plain text of the inscriptions.
Once we have a plain text file for each inscription, we can begin the process of named entity extraction. We decided to follow the methods and instructions shown in the two Sunoikisis DC classes on Named Entity Extraction:
Here is a short outline of the steps might involve when this is done in the future.
- Split text into tokens, make a python list
- Create a baseline
- cycle through each token of the text
- if the token starts with a capital letter it’s a named entity (only one type, i.e. Entity)
- Classical Language Toolkit (CLTK)
- for each token in a text, the tagger checks whether that token is contained within a predefined list of possible named entities
- Compare to baseline
- Natural Language Toolkit (NLTK)
- Stanford NER Tagger for Italian works well with Latin
- Differentiates between different kinds of entities: place, person, organization or none of the above, more granular than CLTK
- Compare to both baseline and CLTK lists
- Part-Of-Speech (POS) tagging – precondition before you can perform any other advanced operation on a text, information on the word class (noun, verb etc.); TreeTagger
- Chunking – sub-dividing a section of text into phrases and/or meaningful constituents (which may include 1 or more text tokens); export to IOB notation
- Computing entity frequency
Although we didn’t make as much progress as we would have liked, we have achieved our aim of creating a script to prepare individual files for NER processing, and have therefore laid the groundwork for future developments in this area. We hope to build on this work to successfully apply NER to the inscription texts in the EDH in order to make them more widely accessible to researchers and to facilitate their connection to other, similar resources, like Pelagios.