Category Archives: TEI/XML

Hacking the Early Modern: the EEBO-TCP hackfest

[The original version of this post was first published by ABO Public: An Interactive Forum for Women and the Arts 1640-1830].

So in March, I was invited to my first hack. Me, an English Literature lecturer was going to have to produce something with computers in one day? Now read on …

 Hunched over our laptops in the Weston library

Hunched over our laptops in the Weston library

This was the EEBO-TCP hackfest, an event designed to launch the release into the public online domain of over 25,000 texts from the fifteenth to the seventeenth century. These texts have been curated and encoded by the Text Creation Partnership, a collaborative project between the University of Michigan, the Bodleian Library University of Oxford, and Proquest, the publishers of online database Early English Books Online. The idea of the hackfest was that humanities researchers and scholars would come together with digital researchers and technologists and create – in a day – innovative and imaginative ways of exploring, analysing, and developing this huge corpus. Now, while I’ve been tinkering with digital humanities approaches myself, I’m no programmer. Moreover, I’m an eighteenth-century-ist so I was stepping a little outside my normal safety zone. So it was with some trepidation, yet also with considerable excitement, that I dipped a toe into my first digital hack. The setting was the new Bodleian Weston library: appropriately for a day building things, it was still under construction.

It started with a speed-date. Over plenty of coffee thirty-or-so of us circulated around telling our stories and plans to anyone we could button-hole. Given humanists seem to be in the majority, most people were looking for a tech person to help out, and my case, slightly desperately so. My idea was to analyse some of the structural features of pre-eighteenth-century fiction, such as dedications, prefaces, letters to the reader, chapters, illustrations etc. But what I didn’t know was how to bring out that data from a large corpus and produce something potentially meaningful.

Hattige XML grab
Detail of the XML file of Gabriel de Brémond, Hattige: or The amours of the king of Tamaran A novel. 1683.

I needn’t have worried. Everyone was incredibly receptive and eager to make our plans work, so I found my geek (I know he’s happy with that epithet!): the extraordinarily energetic Dan Q from the Bodleian’s digital team. Together with a couple of people

Dan Q looking at my trusty mac
Dan Q leaning over my trusty mac

working with formal features of seventeenth-century alchemy texts, we found ourselves a table and began to work out how we might visualize this structural data. And this is the part that I found really exciting: within a couple of hours I had created a sub-corpus of fiction from the total of 25,000 texts, Dan had written some code to identify and count all the structural features I could think of (with some advice from Simon Charles from the TCP project about the TEI markup), and it had started producing some figures. With the knowledge that we all had to present our work at the end of the day, I had to think of ways to set out the results to suggest some kind of point to all this: in short, the ‘so what? question. (The crude but quick answer: by putting the texts in chronological order and colour-coding our Excel sheet, a hint of a pattern emerged).

Meanwhile, others in the room were experimenting with identifying the frequency of colour words, the use of Latin, simulating the shelves of the St Paul’s book-sellers, and even creating a game based on witch-trials (this by Sarah Cole, using Twine), and a team thinking about how to make the archive user-friendly to a more diverse audience (see Sjoerd Levelt’s prize entry to the EEBO-TCP Ideas Hack competition). Given my idea was conceived off-the-cuff, it was rather splendid to share third prize with our colleagues working on the same table.

What impressed me was the advantages offered by scale of the corpus and the rigour of its markup. Both of these features of the TCP project enabled me and Dan to produce – with surprising speed – a set of results for a question that would otherwise be much more difficult to answer. But what really blew my mind was how my tech guy took my simple question to another level: Dan wondered ‘how the structural differences between fiction and non-fiction might be usable as a training data set for an artificial intelligence that could learn to differentiate between the two’ (see his own blog post on the event).

TCPhack-nicework-DanQ
‘Nice work Stephen” Nice work Dan”

I came away a slightly different academic, no longer intimidated by big data, enthused by digital collaboration, and now a big fan of the day-long hack.

Advertisements

Encoding with English Literature undergrads

xmlgrabThis is an overview and reflection on a two-hour workshop I ran for English Literature undergraduates introducing XML/TEI. ‘Encoding worksheet’ (word doc) is here.

Previously I had taught XML/TEI in one-to-one tutorials, so this was the first time I had tried a group workshop, comprising two students who I was supervising (their final year dissertation projects were digital editions) and two students whose projects concerned print editing (from a module on Early Modern book history run by Prof. Ian Gadd). The knowledge base of these students was very varied: some had no experience of coding or markup; at the other end of the spectrum one was already competent with HTML. What, then, was the best way into encoding given this varied cohort?

TEI adviceMy answer was to start with the skills they already had (as @TEIConsortium emphasised), and emphasise the continuum between digital encoding and the traditional literary-critical analysis students use when preparing any text. After all, we’re so frequently concerned about the relationship between form and meaning. And it is the particular capability of XML/TEI to render this relationship between form and meaning that distinguishes it from other kinds of electronic coding.

So the first part of the workshop started with pencil-and-paper tasks. We first annotated a photocopy of a poem. Then I gave them a print out of the transcribed poem stripped of some of its features – title, line spaces, peculiar line breaks, italicisation. I then asked them to annotate, or markup, this version with a set of instructions to make it look like the ‘original’. The result was that the students not only marked up formal features, but clearly had a sense that these features also carried meaning. For example, I asked, “why was it important to render a line space?” I also pointed out that none of them had inserted the missing title in the plain text version, which raised some eyebrows: “Is it part of the text?” “Well, how do you define the text?”, I replied. These question were important for several reasons. I wanted to make the point that markup was a set of editorial and interpretative decisions about what the ‘text’ was and how it might be rendered and for what purpose. I also wanted to emphasis that both practices – whether pencil notes in the margin or encoding on a screen – involved very similar processes.

I next wanted to translate these points into an electronic context, by illustrating the differences between HTML as, essentially, a markup for how a text looks, to XML as a markup for describing that text. I did this by using my WordPress editor: by inserting a few HTML tags in the text editor mode then switching to the ‘visual’ mode they could see these features reproduced.[1]

At this point we moved to the computers and got down to some encoding in an XML editor (Oxygen). My main aim here was to enable them to markup the same poem in an XML editor to see how easily their literary-critical procedure could be transferred to this medium. In this, I was very gratified: all the students were able to create an XML file and mark up the poem remarkably easily.[2] I spent the last section of the workshop answering the implicit question: “you can’t read XML, so what is this for?” Given the restrictions on time, I had to briefly engage with some very broad issues of digitization and preservation and of analysing big data. Putting it simply, I remarked “computers are stupid,” (my mantra), “but if we markup up our texts cleverly, we can get computers to look at large bodies of knowledge with precision.” Demonstrating this was tricky given the time restrictions, but I had a go by exemplifying the more complex encoding of meaning possible in XML/TEI. I used a former student’s markup of Defoe’s Hymn to the Pillory and an XML file of A Journal of a Plague Year. The former demonstrated the encoding of names; for example I asked “how would a computer know that ‘S—ll’ is Dr Henry Sacheverell unless you have a way of encoding that?” The Journal was useful for demonstrating the highly structured nature of TEI and the ability of us to markup structural features of texts in precise ways: features that a computer can then process.

Journal-XMLgrab

I also demonstrated the flexibility of TEI: by inserting a new < after a </> XML automatically shows a dropdown list of possible markup elements and attributes. But my key point was that deciding which features to encode – out of all the possible features of a text – was an interpretative and editorial decision.

My aim for the workshop was modest: to enable students to make the leap from so-called ‘traditional’ literary-critical skills to the basics of encoding in XML, and in this I think the session was successful. On reflection, I think there two points which I hadn’t judged quite right. I hadn’t anticipated how quickly they could mark up a poem in XML; I think that was because the transition from pencil annotations to coding on screen worked very well. The last section – on the bigger point of getting computers to read literary texts – turned out to be more important than I had presumed and I would do this differently if I were to run this again. This might involve a follow-up session that, given the success of the first part of the session which involved hand-on tasks, would ask students to markup some more complex textual issues with TEI. This could be combined with a demo that not only showed some well-encoded texts but also the results of some data-mining of a medium-sized XML/TEI corpus.

I’ll keep you posted …

[1] There are probably better ways to demonstrate this, given the limitations of the WP text editor, but it was very much to hand.

[2] I acknowledge here my use of teaching materials from the Digital Humanities Oxford Summer School (the very same ones from which I had learnt TEI).