Data Preparation – the Basics
Getting our initial batch of data from the PELAGIOS partners into the Graph Explorer was both easy and a bit of a challenge at the same time. As for the easy part: the two ‘PELAGIOS principles’ of…
- aligning place references with PLEIADES and
- using the OAC vocabulary to express them in RDF
make the ‘baseline’ import almost effortless. We can simply parse the RDF, pick out the OAC annotations, verify whether they point to a valid Pleiades URI – job done. Therefore, if you want to make your own data PELAGIOS-ready, complying with these two principles is really all you need to do. We’ve included some RDF samples into our code repository, which you can use as a reference regarding the exact RDF syntax. Furthermore, we are also working on a (yet unfinished) online application that generates maps from properly-formatted data dumps, thereby providing online validation for your data.
Structuring the Data
Now on to the advanced part… Once you have produced your OAC-formatted list of PLEIADES URIs, it’s really just that: a long, flat list of places. Already that’s useful for building basic visualizations – such as maps showing a dataset’s geographic extent, or Google-Map-mashups where pushpin-markers link to source texts. But for the Graph Explorer, we wanted to show a more fine-grained picture of the connections within the data.
Usually, a dataset will have some sort of internal hierarchy: a subdivision of an archaeological collection into different sub-collections perhaps; or a structuring of a text corpus into books, subdivided into volumes, chapters, paragraphs and so on. Speaking in terms of the Graph Explorer, this means that when we search for, say, Memphis and Delos, it can tell us that both are mentioned on Herodotus, page 125, rather than giving us the (somewhat less useful) information that both are referenced in GAP’s Google Books dataset.
Unfortunately, the ‘PELAGIOS principles’ don’t define an explicit mechanism for expressing such structural information at the moment. Nonetheless our partners’ datasets often reflect hierarchy in the design of their resources’ URIs: for example, GAP’s Google Book URIs carry book IDs and page numbers; annotations provided by Perseus include subdivisions into individual chapters, sections, poems, etc.
To make the Graph Explorer’s output more useful, I therefore exploited this implicit information to build the hierarchy in the import script. The import scripts also generate human-readable labels for the hierarchical dataset units, based either on consultation with partners (e.g. we simply agreed on how we would name SPQR’s sub-collections and coded that into the import script), or additional metadata in the data dumps (e.g. GAP has rdfs:labels included in the data dump to define the labels).
In hindsight, it may have made sense to think about such an additional ‘principle’ to cover this (e.g. by including an RDF vocabulary like VoID). But then again, at the start of the project the discussion was very much revolving around the groundwork of getting datasets aligned at all, and the Graph Explorer was still yet a vague idea. (Not to mention that the sheer diversity of the datasets would make the development of a consistent, reasonably fine-grained description scheme a project in its own…)
Importing your Data
The bottom line of all this is: getting a hierarchical, custom-labeled dataset into the Graph Explorer will still require some manual tweaking (read ‘coding effort’) at the moment. With a little bit of Java development skills, the process should be fairly viable, however: the essential importer classes are fairly well documented, and there are a number of code examples in the repository.
By the way: we’ve implemented most of the importers in the Java-based scripting language Groovy, which worked really well for us and helped us trim down the size of the import scripts by some lines of code, compared to plain old Java. In particular I’d recommend taking a look at the GAP and Perseus importer source code to get started.