1748: ‘Fiction’ in the Database:

Not so long ago I was reviewing a lecture I regularly present to students studying Samuel Richardson’s Clarissa. Looking back, I had no idea that this would lead me to speculate about how bibliographic data relating to English literary history is recorded in electronic databases.

It was a lecture that aimed to give some context about the ‘rise of the novel’ and I always had fun by reminding them just how illegitimate the novel was in the first half of the eighteenth century, and how even literary works formed a tiny proportion of what was published. But this time around, I thought I would actually present them with evidence. Some actual quantities. I came up with the idea of homing in on the year Clarissa was first published: 1748. My first attempt was quick and dirty.


Using the English Short Title Catalogue (ESTC) I typed in the year 1748, left everything else blank, and noted how many publications it returned (2550). I then thought it would be instructive to see how many of these were literary (in the loosest sense), so I went to Eighteenth-Century Collections Online (ECCO) and narrowed the 1748 search down via their category ‘Literature and Language’ (c.250 hits).[1] Now to find how many of those were novels. Both databases yielded results with the subject ‘fiction’ and I then – to ram home the point – narrowed that list down to new titles published that year. Only 0.5% of all works produced in 1748 could be classified as new fiction.[2] In a culture which perceives imaginative writing as practically synonymous with the novel, the result was a gratifying gasp of surprise from my student audience.

However, this rough-and-ready exercise set me on a different path, and made me think about how these databases, upon which we rely so trustingly, categorise our literary heritage. The simple exercise above revealed clear disparities between these databases in both the numbers and the titles returned, and some odd things about the way ESTC and ECCO had tagged these works. For a detailed breakdown of tags and titles I found, see my spread-sheet here.

A quick bit of history. The page images available in ECCO are digital scans of microfilm photographs of the original physical copies; in other words, as Ben Pauley has pointedly remarked, a remediation of a remediation.[3] The original microfilming was contracted out by the British Library in the mid- to late-twentieth century. These were then purchased and sold to research libraries by a US company called ‘Research Publications’ (I have a sudden flash of memory from my postgraduate days, seeing that name on the microfilm boxes as I painstakingly loaded a film into the reader). In the 1990s that company was then bought by Gale.[4] By 2002 the microfilms had been scanned and ECCO was launched as commercial database in 2003. A second tranche of material (ECCO Part II) was published in 2009.

ECCO got its bibliographic meta-data (for example, details about printers, publishing history, physical description, holding libraries) from the ESTC. However, the ESTC itself has a tangled history. It began life as the Eighteenth-century Short Title Catalogue in 1977. In 1987 it extended its remit to include material from c.1472 to 1700 (incorporating data from the Short Title catalogue of books printed in England, Scotland, and Ireland, and of English books printed abroad, 14751640 and the Wing catalogue which covered the period 1641-1700), and was then renamed the English Short Title Catalogue.[5] Indeed, the precise relationship between ECCO and the palimpsest that is the ESTC is an obscure one, echoing (if you’ll forgive the pun) that between Pro-Quest’s database Early English Books Online, the ESTC and the Short Title Catalogue, as Bonnie Mak has elegantly pointed out.[6]

When it comes to the question of how subject headings were assigned, there are few hard facts. However, Gale-Cengage gives some clues about this metadata on their website FAQs. At some point around 2009, just before the second tranche of digitized texts were published, the MARC (Machine Readable Catalogue) records for ECCO were ‘enhanced’ by adding Library of Congress (LoC) subject headings.[7] These were obtained from ‘existing’ library records which held the physical copy. However, where this was not possible, ‘ESTC licensed the work of adding LoC headings.’ This process resulted in ‘[o]ver 274,000 subject headings’ being added; Gale notes that these ‘were added through the combination of harvesting and manual assignment.’[8]

It seems there was at least considerable potential for divergence between these two systems of gathering and assigning subject headings, driven as they were by different organisations and groups of people. This might well have led to the bibliographers or cataloguers at Gale to adopt a different way of tagging and searching for subject headings.

Returning to the oddities I encountered in preparing my lecture: ESTC enables a search via ‘Subject (genre)’ and ‘Subject;’ ECCO has a drop-down option for ‘Subject.’ However, while ESTC tagged the genre field with ‘novel’ or ‘fiction’ and its subject field ‘fiction’, ECCO tagged the subject fields as ‘fiction’ and/or ‘English fiction’ (note the ‘and/or’ for further confusion). In all, this yields five different sets of results. Moreover, just looking at the widest set of results for the subject heading of ‘fiction’ (including reprints and new editions), the most striking aspect was the far larger number of results returned by the ESTC than by ECCO. There are no instances where ECCO identifies a work as ‘fiction’ that the ESTC does not. Even when ECCO tags A spy on Mother Midnight: or, the Templar Metamorphos’d as ‘fiction’ and the ESTC does not, the ESTC nevertheless tags it as ‘novels.’ However, there are some notable instances where ECCO does not follow the ESTC’s lead.[9] For example, where the ESTC rightly categorises Henry Carey’s Cupid and Hymen: a voyage to the isles of love and matrimony as ‘fiction’ it is not listed as such in ECCO. Even more obviously missing as ‘fiction’ in ECCO is Henry Fielding’s canonical novel The History of the Adventures of Joseph Andrews! Conversely, someone at ECCO must have thought tagging Ovid’s Heroides. English Ovid’s epistles … Translated into English verse as ‘fiction’ – as did the ESTC – was, at best, misleading.

Perhaps this goes beyond the issue of the management of data? It is intriguing to speculate on the human intelligence behind the original LoC headings and how they were assigned. Are we talking about individuals who were re-interpreting the nature of the actual texts themselves? How else to account for some of these idiosyncrasies?

Let’s go back to Fielding’s The History of the Adventures of Joseph Andrews (first pub. 1742; 4th ed. 1748) which is tagged by the ESTC as ‘Tobacco-fiction,’ a subject heading that is at least consistent across the ESTC and ECCO. But this is assigned to just three texts in the whole catalogue; the other two are novels by Tobias Smollett: The Adventures of Peregrine Pickle (1751) and The Expedition of Humphry Clinker (1773). Now, it’s true that there are people who smoke in these novels; but there are plenty of other protagonists from the fiction of the period who smoke too and it’s not as if tobacco is a significant plot-device. To take one more example, the anonymous Suite des lettres d’une Peruvienne. Again the subject heading is consistent across the two databases: ‘Epistolary fiction, French-18th century;’ but it is the only title from the entire database that is associated with this subject heading.

More interesting still is what happens to the two variants of Nehemiah How’s A narrative of the captivity of Nehemiah How. For the first on my list (ESTC Number W014008) ECCO seems to agree with its status as fictional, although its ESTC category ‘novels’ has been changed to the less contentious ‘fiction.’ Was someone working for Gale more astute in their reading of eighteenth-century narrative form? Human interpretation in the database is also evident when it comes to the other variant (ESTC number W34168), which looks to have been added later since it appears in ECCO Part II. Notably any tags formally declaring its fictionality have gone: in the ESTC it is replaced with the more precise genre tag of ‘captivity narrative.’ However, in the ECCO even this slight hint of narrative is ignored, and instead opts to follow ESTC’s more historical-sounding subject tag of ‘Indian captivities.’

More anomalies could be found (help yourself!) but these few examples are intriguing. How this metadata has been assigned seems to have been the result of a tangled history of cataloguing and bibliography, machines and human agency, and the messy process of translation between academic projects and commercial digital publishing. It’s a warning – just in case we need another – about how we use the meta-data available to us via resources like ECCO, EEBO and the ESTC. While invaluable, careful use also requires knowledge about the historical processes behind the creation of these databases. We might also say that human database bibliographers faced the same problems of interpreting and categorising the eighteenth-century novel as literary scholars do, and as critics in the eighteenth century clearly did. So one more thing: it’s easy to forget that behind the search interface on your computer screen, that black box of the database, what we are looking at is evidence everywhere of human intelligence, diligence, error, and above all, interpretation.



Hacking the Early Modern: the EEBO-TCP hackfest

So in March, I was invited to my first hack. Me, an English Literature lecturer was going to have to produce something with computers in one day? Now read on …

This was the EEBO-TCP hackfest, an event designed to launch the release into the public online domain of over 25,000 texts from the fifteenth to the seventeenth century. These texts have been curated and encoded by the Text Creation Partnership, a collaborative project between the University of Michigan, the Bodleian Library University of Oxford, and Proquest, the publishers of online database Early English Books Online. The idea of the hackfest was that humanities researchers and scholars would come together with digital researchers and technologists and create – in a day – innovative and imaginative ways of exploring, analysing, and developing this huge corpus. Now, while I’ve been tinkering with digital humanities approaches myself, I’m no programmer. Moreover, I’m an eighteenth-century-ist so I was stepping a little outside my normal safety zone. So it was with some trepidation, yet also with considerable excitement, that I dipped a toe into my first digital hack. The setting was the new Bodleian Weston library: appropriately for a day building things, it was still under construction.

It started with a speed-date. Over plenty of coffee thirty-or-so of us circulated around telling our stories and plans to anyone we could button-hole. Given humanists seem to be in the majority, most people were looking for a tech person to help out, and my case, slightly desperately so. My idea was to analyse some of the structural features of pre-eighteenth-century fiction, such as dedications, prefaces, letters to the reader, chapters, illustrations etc. But what I didn’t know was how to bring out that data from a large corpus and produce something potentially meaningful.

Detail of the XML file of Gabriel de Brémond, Hattige: or The amours of the king of Tamaran A novel. 1683.

I needn’t have worried. Everyone was incredibly receptive and eager to make our plans work, so I found my geek (I know he’s happy with that epithet!): the extraordinarily energetic Dan Q from the Bodleian’s digital team. Together with a couple of people

Dan Q leaning over my trusty mac

working with formal features of seventeenth-century alchemy texts, we found ourselves a table and began to work out how we might visualize this structural data. And this is the part that I found really exciting: within a couple of hours I had created a sub-corpus of fiction from the total of 25,000 texts, Dan had written some code to identify and count all the structural features I could think of (with some advice from Simon Charles from the TCP project about the TEI markup), and it had started producing some figures. With the knowledge that we all had to present our work at the end of the day, I had to think of ways to set out the results to suggest some kind of point to all this: in short, the ‘so what? question. (The crude but quick answer: by putting the texts in chronological order and colour-coding our Excel sheet, a hint of a pattern emerged).

Meanwhile, others in the room were experimenting with identifying the frequency of colour words, the use of Latin, simulating the shelves of the St Paul’s book-sellers, and even creating a game based on witch-trials (this by Sarah Cole, using Twine), and a team thinking about how to make the archive user-friendly to a more diverse audience (see Sjoerd Levelt’s prize entry to the EEBO-TCP Ideas Hack competition). Given my idea was conceived off-the-cuff, it was rather splendid to share third prize with our colleagues working on the same table.

What impressed me was the advantages offered by scale of the corpus and the rigour of its markup. Both of these features of the TCP project enabled me and Dan to produce – with surprising speed – a set of results for a question that would otherwise be much more difficult to answer. But what really blew my mind was how my tech guy took my simple question to another level: Dan wondered ‘how the structural differences between fiction and non-fiction might be usable as a training data set for an artificial intelligence that could learn to differentiate between the two’ (see his own blog post on the event).

I came away a slightly different academic, no longer intimidated by big data, enthused by digital collaboration, and now a big fan of the day-long hack.

Play, experiment, and digital pedagogy

CSIRO_ScienceImage_7630_test_tubesFirst of all, a hat-tip to Willard McCarty: during a talk at Bath Spa University in March of this year, he quoted early-twentieth-century English critic I. A. Richards and it was this that crystallised my scattered thoughts on my students’ encounter with digital approaches to English literature. Richards prefaced his book Principles of Literary Criticism with the highly suggestive notion that ‘[a] book is a machine to think with’. Richards’ image was not an idle one: an ardent believer in the interplay between the arts and sciences, both his book and the book in the abstract – like any piece of technology from the automated looms of the late eighteenth century onwards – embodied human-designed creative procedures. Through the book, by bringing to bear those same human processes of thought, we are able to examine civilization and what it is to be human: the very task the book was designed to ‘re-weave’.[1] In the digital age it is hard to avoid the resonances: the preeminent machine of our age – the computer – is also governed by human procedures (programming) and ‘processing’ has now become almost entirely associated with computers. Yet we forget that books are, as Richards is implying, an invitation to be (re)processed by humans. What I want to emphasise is that this re-processing – what we less starkly call literary criticism – can be envisioned as a series of procedural building blocks.

What I’m also drawing upon has been defined by Ian Bogost as ‘procedural literacy’. Developing the idea that computing programming is a kind of literacy, Bogost proposed that ‘any activity that encourages active experimentation with basic building blocks in new combinations contributes to procedural literacy.’ Such a literacy in processes and procedures (such as I have described) becomes a foundation that can be applied elsewhere: ‘[e]ngendering true procedural literacy means creating multiple opportunities for learners—children and adults—to understand and experiment with reconfigurations of basic building blocks of all kinds.’[2]

This movement between play, experimentation and a critical awareness in the processes of interpretation was evident during a session on my undergraduate module Digital Literary Studies. Students were introduced to distance reading and invited to work with Voyant Cirrus on eighteenth-century novels. It was apparent in the workshops that the preliminary results of this analysis were not immediately significant or meaningful. So, the next stage involved playing with word choices, selecting synonyms to create clusters of meaning, or choosing antonyms to gain critical leverage. Given these were historical texts, another step involved researching historical inflections using the OED. Some students wanted add another interpretative layer: using Google’s N-Gram Viewer (with caution) they zoomed out even further. It was interesting to watch. The movement between these steps was not linear: some students moved back into the print copy of the novel for a close reading; some students shuttled back and forth between a few key procedures.

The initial surprise that textual visualization did not produce an immediate interpretation was a useful warning about the technological lure of instant answers. Instead, results became merely a first step in a series of experiments: each set of word choices – let’s call them hypotheses – required us to re-think the interpretative assumptions about the text(s). Moreover, the significance of the results was also subject to constant discussion, as if the text itself was changing shape. What my students discovered via this experimentation is the fascinating tension between different processes of interpretation: between what I. A. Richards might call re-weaving and what Lisa Samuel and Jerome McGann termed ‘deformance.’[3] The aim of the session was to generate some analyses of the literary history of the novel between 1660 and 1799; but the session also enabled students to slow down and reflect on their processes of interpretation: it trained them to be procedurally literate.

I started with citing I.A. Richards, part of a group of critics and intellectuals who in the early twentieth century placed close reading at the heart of English Studies. Despite its varied fortunes it is still there. What is most resonant for me and my students is the interplay between close reading, digital reading and procedural literacy. Experimentation puts both students and tutor at the very edge of their knowledge, but it is a place that is productively challenging. In also helping students to see their learning as series of processes that can be modified and reiterated, we are also enabling them with a critical and creative self-awareness that fits them for the rapidly changing twenty-first century world.

[1] I.A Richards Principles of Literary Criticism. 3rd ed. London: Keagan Paul, 1926, vii.

[2] Ian Bogost, ‘Procedural Literacy: Problem Solving with Programming, Systems, & Play.’ , 52:1&2 (Winter/Spring, 2005), 32-36.

[3] Lisa Samuels and Jerome McGann, ‘Deformance and Interpretation.’ New Literary History 30:1 (1999), 25-56.


What is a novel in the eighteenth century? Some numbers …

Some of my undergradutes playing with data…

Digital Literary Studies

Students Ben Franks and Alice Creswell share their charts on some keyword searches conducted via the the ‘Genre’ filter in ESTC across 1660-1799. The first chart breaks down the 2,880 hits from the genre term ‘Fiction’ into various title keywords:

Fiction Fiction

This second pie-chart breaks down the 1,434 hits from the search term ‘Novels’:

Novels Novels

We wondered about the ways in which the ESTC catalogue had tagged these genres and the exent to which they overlapped (meta-metadata questions?). But these results were given additional context and meaning by setting them against the same keyword searches on Google’s N-Gram viewer and some more granulated searches of the metadata of the 1,000 novels in the Early Novels Database.

Ben and Alice’s favourite titles? The Devil Turn’d Hermit (check that full title!) and Adventures of a Bank-Note.

Fun with Google’s N-Gram Viewer for my C18th students

Just the other day I was preparing to teach Sterne’s A Sentimental Journey (1768). Usually, I ask students to try to historicise the meanings of the word ‘sentimental’, in effect placing it within the broader culture of sensibility. This year, wondering how I might emphasise how new and even fashionable the word sentimental became in the latter half of the century, I thought of Google’s N-Gram Viewer. I’d seen this in action in relation to the eighteenth-century on the Persistent Enlightenment blog. So  I thought I’d give it a go:


There’s a gratfyingly significant rise from around the 1750s (and a small dip around the 1790s when sentimentalism was perceived in Britain to be associated with the radical levelling tendencies of the French Revolution). Of course this does not give us insight into the meanings of the word, but Google also offers links to the word’s place in the source material so that I hope my students can look at the word in context. It’s also useful when used in context with title searches on ESTC.

On another note, and since my students also engage with eighteenth-century contextual material from ECCO, I’ve often warned that the practice of capitalising certain words did not necessarily indicate particular significance, and that this was more often a printer’s convention for certain nouns that gradually died away towards the end of the century. The N-Gram Viewer is case-sensitive, so to search for different cases I clicked ‘case-insensitive’ and searched for ‘virtue’ between 1700 and 1799:


It’s great to see that cross-over so clearly. Clearly, the idea of virtue wasn’t going out of fashion, but the fashion for capitalising it was.