A place for discussion about anything related to the Recogito annotation platform
Recogito Export Formats
June 14, 2016 at 10:01 am #1430
In a previous discussion the topic of export formats has repeatedly come up, so I thought I’d create a dedicated topic for this. To outline a few basic points:
- The “old” Recogito could export annotation data as Comma Separated Value files. A CSV is a single table, with one row of data for each annotation; and each row consisting of fields such as toponym, gazetter URI, verification status, geo-coordinates, character offset of the annotation in the text (or pixel coordinates in the image), etc.
- In addition, the old Recogito provided crude TEI/XML download for text documents. This would give you the annotated text content (rather than just the standoff annotations), in a simple XML template, with places tagged up via TEI placeName tags.
The question, now, is what export options we want to have in the future.
- I hacked a so-called JSON Lines download into the sandbox at recogito.pelagios.org. IMO this is a very convenient format in particular once datasets get large. But as far as the content of current JSON is concerned, do consider it a placeholder only! (For example, the current download doesn’t yet include geo-coordinates.)
- CSV will (very likely) stay as an option, too. However, the internal data model for the annotations has now become quite a bit richer. Any tabular representation will inevitably need to be “dumbed-down” to fit the flat structure of a single table.
- Another option I consider pretty much fixed is RDF, in Open Annotation format.
- GeoJSON has been brought up. I support this idea! It would make mapping the output of Recogito very easy. A bit of an open issue is how we’d encode the annotations. In general, each geographical feature could be connected to multiple (if not many) annotations. GeoJSON doesn’t really put a limit on what “metadata” can be associated with a geographical feature. But most out-of-the-box mapping tools will (to my knowledge) only display properly if the metadata consists of simple key/value pairs. I.e. like for CSV, we’d probably need to dumb down the actual annotation data to make it “mappable”.
I’d be interested in any thoughts and “stories” you might have on use cases you are planning, tools you might want to use to post-process Recogito data, how you’d like your annotations to appear on maps, etc.
June 17, 2016 at 11:34 am #1466
I haven’t replied to this until now because it’s not something I have much of a clue about.
Do you have any thoughts about what proportion of the total Recogito content requires a more complex stucture (i.e. uncertainties, multiple identities, etc)? I mean, there’s no point exporting ‘mappable’ formats if it means losing much of the content. You’d get nice looking visualisations that tell a misleading story. On the other hand, if these uncertainties are few, then perhaps you should go for the mappable formats, and those of us with ‘issues’ can treat the data manually.
Have I missed the point of the question?
June 17, 2016 at 11:45 am #1469
I think that’s a good point. Frankly, I absolutely don’t know what proportion of content (or, in my opinion, users) will have what needs. We’ll have to wait and see, but my guess is we’ll eventually be seeing a larger base of users with relatively simple needs. (Text in, map + a bunch of comments out.) And a number of “power-users” with very specific needs.
As long as we have ways to deal with a sufficient amount of complexity internally, it’s never going to be a problem to offer download formats that just represent a simplified view (at the benefit of requiring less expertise, so that simple things can be done quick.) I fully agree though that we’ll need to find the right balance between “showcasing” those simple data formats on our Download Options page on the one hand, and making clear their limitations (and dangers) to the user on the other.
You must be logged in to reply to this topic.