Category Archives: Databases

1748: ‘Fiction’ in the Database:

Not so long ago I was reviewing a lecture I regularly present to students studying Samuel Richardson’s Clarissa. Looking back, I had no idea that this would lead me to speculate about how bibliographic data relating to English literary history is recorded in electronic databases.

It was a lecture that aimed to give some context about the ‘rise of the novel’ and I always had fun by reminding them just how illegitimate the novel was in the first half of the eighteenth century, and how even literary works formed a tiny proportion of what was published. But this time around, I thought I would actually present them with evidence. Some actual quantities. I came up with the idea of homing in on the year Clarissa was first published: 1748. My first attempt was quick and dirty.

'Gasp!'
‘Gasp!’

Using the English Short Title Catalogue (ESTC) I typed in the year 1748, left everything else blank, and noted how many publications it returned (2550). I then thought it would be instructive to see how many of these were literary (in the loosest sense), so I went to Eighteenth-Century Collections Online (ECCO) and narrowed the 1748 search down via their category ‘Literature and Language’ (c.250 hits).[1] Now to find how many of those were novels. Both databases yielded results with the subject ‘fiction’ and I then – to ram home the point – narrowed that list down to new titles published that year. Only 0.5% of all works produced in 1748 could be classified as new fiction.[2] In a culture which perceives imaginative writing as practically synonymous with the novel, the result was a gratifying gasp of surprise from my student audience.

However, this rough-and-ready exercise set me on a different path, and made me think about how these databases, upon which we rely so trustingly, categorise our literary heritage. The simple exercise above revealed clear disparities between these databases in both the numbers and the titles returned, and some odd things about the way ESTC and ECCO had tagged these works. For a detailed breakdown of tags and titles I found, see my spread-sheet here.

A quick bit of history. The page images available in ECCO are digital scans of microfilm photographs of the original physical copies; in other words, as Ben Pauley has pointedly remarked, a remediation of a remediation.[3] The original microfilming was contracted out by the British Library in the mid- to late-twentieth century. These were then purchased and sold to research libraries by a US company called ‘Research Publications’ (I have a sudden flash of memory from my postgraduate days, seeing that name on the microfilm boxes as I painstakingly loaded a film into the reader). In the 1990s that company was then bought by Gale.[4] By 2002 the microfilms had been scanned and ECCO was launched as commercial database in 2003. A second tranche of material (ECCO Part II) was published in 2009.

ECCO got its bibliographic meta-data (for example, details about printers, publishing history, physical description, holding libraries) from the ESTC. However, the ESTC itself has a tangled history. It began life as the Eighteenth-century Short Title Catalogue in 1977. In 1987 it extended its remit to include material from c.1472 to 1700 (incorporating data from the Short Title catalogue of books printed in England, Scotland, and Ireland, and of English books printed abroad, 14751640 and the Wing catalogue which covered the period 1641-1700), and was then renamed the English Short Title Catalogue.[5] Indeed, the precise relationship between ECCO and the palimpsest that is the ESTC is an obscure one, echoing (if you’ll forgive the pun) that between Pro-Quest’s database Early English Books Online, the ESTC and the Short Title Catalogue, as Bonnie Mak has elegantly pointed out.[6]

When it comes to the question of how subject headings were assigned, there are few hard facts. However, Gale-Cengage gives some clues about this metadata on their website FAQs. At some point around 2009, just before the second tranche of digitized texts were published, the MARC (Machine Readable Catalogue) records for ECCO were ‘enhanced’ by adding Library of Congress (LoC) subject headings.[7] These were obtained from ‘existing’ library records which held the physical copy. However, where this was not possible, ‘ESTC licensed the work of adding LoC headings.’ This process resulted in ‘[o]ver 274,000 subject headings’ being added; Gale notes that these ‘were added through the combination of harvesting and manual assignment.’[8]

It seems there was at least considerable potential for divergence between these two systems of gathering and assigning subject headings, driven as they were by different organisations and groups of people. This might well have led to the bibliographers or cataloguers at Gale to adopt a different way of tagging and searching for subject headings.

Returning to the oddities I encountered in preparing my lecture: ESTC enables a search via ‘Subject (genre)’ and ‘Subject;’ ECCO has a drop-down option for ‘Subject.’ However, while ESTC tagged the genre field with ‘novel’ or ‘fiction’ and its subject field ‘fiction’, ECCO tagged the subject fields as ‘fiction’ and/or ‘English fiction’ (note the ‘and/or’ for further confusion). In all, this yields five different sets of results. Moreover, just looking at the widest set of results for the subject heading of ‘fiction’ (including reprints and new editions), the most striking aspect was the far larger number of results returned by the ESTC than by ECCO. There are no instances where ECCO identifies a work as ‘fiction’ that the ESTC does not. Even when ECCO tags A spy on Mother Midnight: or, the Templar Metamorphos’d as ‘fiction’ and the ESTC does not, the ESTC nevertheless tags it as ‘novels.’ However, there are some notable instances where ECCO does not follow the ESTC’s lead.[9] For example, where the ESTC rightly categorises Henry Carey’s Cupid and Hymen: a voyage to the isles of love and matrimony as ‘fiction’ it is not listed as such in ECCO. Even more obviously missing as ‘fiction’ in ECCO is Henry Fielding’s canonical novel The History of the Adventures of Joseph Andrews! Conversely, someone at ECCO must have thought tagging Ovid’s Heroides. English Ovid’s epistles … Translated into English verse as ‘fiction’ – as did the ESTC – was, at best, misleading.

Perhaps this goes beyond the issue of the management of data? It is intriguing to speculate on the human intelligence behind the original LoC headings and how they were assigned. Are we talking about individuals who were re-interpreting the nature of the actual texts themselves? How else to account for some of these idiosyncrasies?

Let’s go back to Fielding’s The History of the Adventures of Joseph Andrews (first pub. 1742; 4th ed. 1748) which is tagged by the ESTC as ‘Tobacco-fiction,’ a subject heading that is at least consistent across the ESTC and ECCO. But this is assigned to just three texts in the whole catalogue; the other two are novels by Tobias Smollett: The Adventures of Peregrine Pickle (1751) and The Expedition of Humphry Clinker (1773). Now, it’s true that there are people who smoke in these novels; but there are plenty of other protagonists from the fiction of the period who smoke too and it’s not as if tobacco is a significant plot-device. To take one more example, the anonymous Suite des lettres d’une Peruvienne. Again the subject heading is consistent across the two databases: ‘Epistolary fiction, French-18th century;’ but it is the only title from the entire database that is associated with this subject heading.

More interesting still is what happens to the two variants of Nehemiah How’s A narrative of the captivity of Nehemiah How. For the first on my list (ESTC Number W014008) ECCO seems to agree with its status as fictional, although its ESTC category ‘novels’ has been changed to the less contentious ‘fiction.’ Was someone working for Gale more astute in their reading of eighteenth-century narrative form? Human interpretation in the database is also evident when it comes to the other variant (ESTC number W34168), which looks to have been added later since it appears in ECCO Part II. Notably any tags formally declaring its fictionality have gone: in the ESTC it is replaced with the more precise genre tag of ‘captivity narrative.’ However, in the ECCO even this slight hint of narrative is ignored, and instead opts to follow ESTC’s more historical-sounding subject tag of ‘Indian captivities.’

More anomalies could be found (help yourself!) but these few examples are intriguing. How this metadata has been assigned seems to have been the result of a tangled history of cataloguing and bibliography, machines and human agency, and the messy process of translation between academic projects and commercial digital publishing. It’s a warning – just in case we need another – about how we use the meta-data available to us via resources like ECCO, EEBO and the ESTC. While invaluable, careful use also requires knowledge about the historical processes behind the creation of these databases. We might also say that human database bibliographers faced the same problems of interpreting and categorising the eighteenth-century novel as literary scholars do, and as critics in the eighteenth century clearly did. So one more thing: it’s easy to forget that behind the search interface on your computer screen, that black box of the database, what we are looking at is evidence everywhere of human intelligence, diligence, error, and above all, interpretation.

 

 

[1] Characteristically, ECCO returns slightly different numbers even when the same search is repeated. See Joseph Dane, What is a Book? The Study of Early Printed Books (University of Notre Dame Press, 2012), pp.224-7.

[2] In this essay I make no claim for a comprehensive list of fiction published in 1748 or even to define what fiction is or was. For example, Jerry Beasley’s Check List of Prose Fiction Published in England 1740-1749 (University Press Virginia, 1972), might also be a good place to start. But would we want to include, for example, chapbooks as fiction? Quite possibly, but neither Beasley’s checklist nor ECCO includes them, and the ESTC’s coverage of this genre is unclear.

[3] Thanks to Ben Pauley; also to Scott Gibbons, Giles Bergel, and Elizabeth Grumbach for helpful conversations.

[4] See Laura Mandell, ‘The Business of Digital Humanities: Capitalism and Enlightenment’, Scholarly and Research Communication, 6.4 (2015). http://www.src-online.ca

[5] http://www.bl.uk/reshelp/findhelprestype/catblhold/estchistory/estchistory.html

[6] Bonnie Mak, ‘Archeology of a Digitization.’ Pre-print, pp.10-11. http://illinois.edu/ds/search?search_type=userid&search=bmak

[7] For the Library of Congress subject headings and genre terms see http://loc.gov/aba/cataloging/subject/

[8] http://gdc.gale.com/products/eighteenth-century-collections-online/acquire/faqs/#marc-enhance

[9] As well as a number of texts which do not exist in ECCO at all.

Advertisements

What is a novel in the eighteenth century? Some numbers …

Some of my undergradutes playing with data…

Digital Literary Studies

Students Ben Franks and Alice Creswell share their charts on some keyword searches conducted via the the ‘Genre’ filter in ESTC across 1660-1799. The first chart breaks down the 2,880 hits from the genre term ‘Fiction’ into various title keywords:

Fiction Fiction

This second pie-chart breaks down the 1,434 hits from the search term ‘Novels’:

Novels Novels

We wondered about the ways in which the ESTC catalogue had tagged these genres and the exent to which they overlapped (meta-metadata questions?). But these results were given additional context and meaning by setting them against the same keyword searches on Google’s N-Gram viewer and some more granulated searches of the metadata of the 1,000 novels in the Early Novels Database.

Ben and Alice’s favourite titles? The Devil Turn’d Hermit (check that full title!) and Adventures of a Bank-Note.

View original post

How a database works: some thoughts on a student task

BeggarsmetadataHere’s some out-loud thinking about a session for my new module Digital Literary Studies. Since the module will require students to work with a wide range of online resources, I really wanted to make sure they could begin to understand how they work. Moreover, the module – via eighteenth-century literature – will be thinking about categorisation and representation, so I wanted to build a set of tasks that would introduce these issues. Below is a draft of what I might give to my students. (Acknowledgement: this is an adaptation of a student task devised by George Williams, who kindly shared it with me in a pub near the British Library). I’ll aim to write a post on how it goes.

Throughout this module we’re going to be working with a variety of online databases and resources, so the aim of this session is to get an idea of what happens behind the scenes (a.k.a the ‘interface’): it’s really about how data is ordered and managed so it can be searched. You might find it helpful before this session to look at other online databases and catalogues you’re used to using to see how you can search them (e.g. JSTOR or the BSU library catalogue).

  1. I’ve given you a number of music CDs: select two each. For each individual CD assign a sheet of paper and write down a list of information about it, beginning with the obvious categories of artist/group name and title of CD. Then move to other categories of information: at this point I’ll leave these up to you (and no conferring at this point – you’ll see why later).
  1. Congratulations, you’ve built a database! Let’s try some searches and see what happens.
  1. Now get together and compare your categories. For each category assign a sheet of paper and list all the relevant data for that category (i.e. one sheet will have all the artists/group names; another sheet will have all the titles; and so on for each category). Well done, you’ve now built what’s called a ‘relational database’.
  1. To what extent did you each order data differently? Was some information difficult to represent or categorise? How did you solve these differences and difficulties?
  1. At this point, we’ll try some more searches using your data and see what comes up and, perhaps, what is missing.

To conclude we’ll compare our database with something like the English Short-Title Catalogue and Eighteenth-Century Collections Online. You’ll note that we’ve built a database that describes objects, but does not actually give us the object itself: in many cases this is called ‘meta-data’. (In different context, the electronic surveillance programmes run by NSA and GCHQ have been described as the analysis of meta-data: for a revealing view on such ‘data-mining’ see this fascinating piece of research by MIT researcher Ethan Zuckerman).

The Digital Miscellanies Index at BSECS2013

I’ve been following the work by the team on the Digital Miscellanies Index (hereafter DMI for short) for the last year and a half, but at this year’s annual meeting of BSECS I had the chance to attend a panel given by the team on some of their latest findings and also to test an early version of the database.

The roundtable panel ‘Compiling the Canon: what can poetic miscellanies tell us? New findings from the Digital Miscellanies Index’ comprised Jennifer Batt, Rosamund Powell, Adam Bridgen and Mark Burden. Jenny Batt – the project’s coordinator – announced the startling fact that the DMI has indexed approximately 1,400 miscellanies from the period. Her own piece exemplified how one would use the DMI by focusing upon Mary Leapor’s poems in various miscellanies; mapping their chronological spread, the source of the poems, and their destination. For example, the biggest number of her poems in the miscellanies were from her first volume of poetry, Poem on Various Occasions (1748). However, her poems also appeared anonymously in some miscellanies, so the DMI also challenges the traditional authorship-centric notions of poetic dissemination, or what Jenny called ‘authorial branding’. Ros Powell’s piece on the mentions of Horace’s Art of Poetry in miscellanies revealed the flexibility of the DMI: she was able to separate mere mentions of or quotations from Horace, translations of Horace, and imitations – whether attributed or unattributed. She was also able to break these varying uses of Horace down into percentages (some nice pie charts too, which I never thought I’d find myself saying in a literary context!). Adam Bridgen fascinatingly concentrated on a surprising and little-studied genre of poem to be found in the miscellanies – the last will and testament. Adam pointed that that the well-known literary genre of the ‘mock testament’ afforded much satiric potential, especially when wielded by Pope and Swift, but what he found in the miscellanies were frequently real wills and testaments rendered in poetic form. The disjunction between form and function in such poems did not necessarily undermine their moral or functional role. However, Adam could not help pointing out that this could go awry and create unintended comic consequences. The final piece by Mark Burden concerned the reconstruction of the reading of dissenting academies and looked to the DMI to be able to aid such research by asking what poetry was being read in the academies. Since I’m a big Defoe fan, I’m going to watch that for what might be revealed in Defoe’s old academy, run by Charles Morton.

The subsequent discussion really brought home the possibilities of the DMI when it is launched, since from these papers it looks like it can enable both close readings and identify larger literary-cultural patterns. Moreover, the DMI has a striking potential for shifting our notions of what was popular, how authors disseminated their work, and even how we conceive reading practices in the period. With that in mind, I was looking forward to the opportunity to play with the early test version of the DMI search interface. All I can say is that – even in this early and not fully integrated version – it was a lot of fun and I’m looking forward to the final version when it is launched.

The DMI blog can be followed here.

This review can now also be found on the BSECS online reviews page.

New summer digital institute: Folger’s ‘Early Modern Digital Agendas’

This sounds very interesting indeed: the Folger library will be running a new summer institute in July 2013 on Early Modern digital humanities. I quote the announcement (on the Early Modern Digital Agendas website):

In July 2013, the Folger Institute will offer “Early Modern Digital Agendas”under the direction of Jonathan Hope, Professor of Literary Linguistics at the University of Strathclyde. It is an NEH-funded, three-week institute that will explore the robust set of digital tools with period-specific challenges and limitations that early modern literary scholars now have at hand. “Early Modern Digital Agendas” will create a forum in which twenty faculty participants can historicize, theorize, and critically evaluate current and future digital approaches to early modern literary studies—from Early English Books Online-Text Creation Partnership (EEBO-TCP) to advanced corpus linguistics, semantic searching, and visualization theory—with discussion growing out of, and feeding back into, their own projects (current and envisaged). With the guidance of expert visiting faculty, attention will be paid to the ways new technologies are shaping the very nature of early modern research and the means by which scholars interpret texts, teach their students, and present their findings to other scholars.

This institute is supported by an Institutes for Advanced Topics in the Digital Humanities grant from the National Endowment for the Humanities’ Office of Digital Humanities.

With thanks to EMOB.

Crowdsourcing the Humanities: Chris Lintott speaks at the Digital Humanities Summer School, Oxford 2012.

While attending the Digital Humanities Summer School at Oxford university this summer, I had the chance to see a variety of lectures. The first of these was by Chris Lintott (Department of Physics, University of Oxford). Chris Lintott has been involved in the development of what has been termed Citizen Science – the communal engagement with science research – and runs one of the most notable of these projects, Zooniverse. My apologies if this is somewhat after the event, but here is the gist of Chris’s talk.

Chris started with the example the data produced by astrophysical research: CERN, for example, produces hundreds of terabytes of data per second during its experiments (Terabyte = c.1000 Gb). This is ‘Big Data’ indeed and pushes at both the limits of computing and the funding of such research. As an answer to the processing and the funding of digging such large amounts of data, crowdsourcing produces a very rich dataset. Involving multiple readers of data, crowdsourcing enables a high level of crosschecking and has been generating original knowledge and insight.

Chris then enumerated a number of examples of science-related projects that use communal collaboration to dig data; the first of which was Galaxy Zoo which analyses data from the Hubble space telescope. Galaxy Zoo makes it easy for non-academics to take part: as you can see on the page that asks for your help classifying types of galaxies, it is as easy as clicking a button. This is a very important feature of getting communal participation: make it too difficult at the first step and you’ve lost your potential researcher. Chris argued that the key to people’s participation in crowdsourcing research like this was motivation: after a motivation survey was conducted that asked what kind of involvement people preferred the largest proportion voted to ‘contribute’. It reflected, he suggested, a powerful desire for people to own their research. Indeed, that first step led on to people producing their own specialised communities (and their own online forums) within the larger Galaxy Zoo community. In most areas of new research there are typically a number of known unknowns, so it was also key to produce task-specific fields of enquiry, managing the kind of questions you want crowdsourced.

The extension of Galaxy Zoo to encompass a number of new areas of large-scale projects resulted in the umbrella project Zooniverse. Chris warned not to ignore the problems of scale and specifically not to underestimate the potential numbers of contributors: across its various projects Zooniverse currently has 666,074 people taking part (Galaxy Zoo on its own has had around 250,000 people involved so far). While the project is dominated by astrophysics (five  projects based on data supplied by space telescopes and satellites) it also includes humanities-orientated projects: transcribing papyrus documents in Ancient Lives, interpreting whale song Whalefm (‘Whalefm’), and analysing historical climate data Old Weather. Old Weather uses the meticulously recorded weather data contained in Royal Navy ships’ logs dating back to the eighteenth century. What’s particularly interesting in this project is that the ships’ logs also include a huge variety of the day-to-day details of shipboard life – anything, in fact, that particular duty officer chose to write down. This data is also included in the project’s database and is fully searchable, so the community is engaging with research well beyond the confines of climatology.

Chris then moved on to discuss a variety of other humanities-focused crowdsourced projects, including the Bodleian library’s project on musical scores What’s the Score. Commenting again on the issue of building motivation, Chris commented that the most successful crowdsourcing projects do not face users with tutorials but use mini-help boxes supplying context as they go long: ‘dump them into the deep end’ he suggested! Indeed, the New York Public library’s project to transcribe thousands of restaurant dishes on its huge collection of historical menus is a good example. Participation in the What’s on the Menu project starts with just the click on one button (they’re up to over a million of dishes). Crucial, then, it to ensure that results are immediately obvious and tangible and that engagement with the wider community is easy. The Ancient Lives project (under the Zooniverse umbrella) involves transcribing ancient papyrus and uses a basic on-screen interface like a transcribing keyboard. It also includes a feature called ‘Talk’ – one click from the interface to engage in immediate responses to a particular image one is working on.

This lead Chris to argue that perhaps ‘crowdsourcing’ may not be the right way of conceptualising the kind of work done by such communal research and suggested that gaming theory might be more applicable to certain projects: an alternative way to imagine the motivation and rewards of crowdsourcing. Examples here include Fold it: a game to research protein molecular structures, which is, it has to be said, complex and expensive. Similar, but much more user-friendly and addictive looking, is DigitalKoot. At first glance this involves two games, ‘Mole Bridge’ and ‘Mole Hunt’, but they are in fact programs designed by the National Library of Finland to transcribe 19th Finnish-language newspapers: as you play you transcribe. Turning analysis into gaming is obviously attractive and involves a shift in motivation. Similarly, the communal engagement with the SETI project (the Search for Extra-terrestrial Intelligence) offers various badges depending on what you have found, from interesting signal to an actual alien. However, this exemplifies the potential problems in gaming and motivation: unsurprisingly no one has yet got the top badge in SETI. In short, Chris argued, don’t replace authentic experience and meaningful participation with goals. Instead, if we wanted to design projects around crowdsourcing, he reminded us that the people who want to get involved in such communal research are specialists in something: build on that.

Digital Humanities and Archives @ ASECS 2012

I think it’s fair to say that this year’s annual meeting attracted more panels on digital humanities than ever before (and that doesn’t even include the pre-meeting THATCamp workshops: for a good review of that see Lisa Maruca’s post on Early Modern Online Bibliography). I’ve posted already on the use of digital technology in teaching 18thC culture, but there were still quite a large number of panels that included discussions of digital humanities – whether explicitly labelled ‘digital humanities’ or not. What interested me were the issues that kept cropping up about how digital archives design data to be searched and how they are actually searched.

I was especially intrigued, in the roundtable ‘Digital Humanities and the Archives’, by Randall Cream’s (West Chester) call for digital archives to try to mimic the joyful moment of “serendipitous discovery” in traditional archives: such “interpretive moments” produced through unexpected answers to “unthought” problems may be difficult to reproduce in digital archives which depend so much upon naming, cataloguing, and tagging. Michael Gavin addressed how one manages the digitization of plays, with the special nature of a play as text and as a theatrical performance. For Michael Gavin, this is not addressed in the current tagging models of TEI, and outlines how he modified the tagging to produce an archive whose searches can be sensitive to these two play-contexts. Clearly, all were agreed that the move towards semantic tagging would enable a more human and sustainable interaction with digital data (semantic tagging, using XML for example, has the ability to describe concepts and meanings; as opposed to HTML which describes the nature of the document and its relation to other documents. If anybody wants to, I’m perfectly willing to be corrected on this very rough definition). In the ‘Poetry and the Archive’ roundtable, questions of use and searchability were again implicit. Jennifer Batt’s (Oxford) description of how the Digital Miscellanies Index could be searched was a good example of a digital resource that, perhaps paradoxically, is a more open-ended research tool: since this is in index of first and last lines and not a digital archive of texts, researchers are perhaps left to their own intuition. It is, of course, arguable: both Andreas Mueller (Worcester, UK) and Kyle Roberts (Loyola, Chicago), in the panel ‘Digital Approaches to Library History’, outlined digital archives that were, in effect, archives with a thesis and so imagined ways of searching that would be directed towards research problems specific to their archives (in this case, library collections that are extant or are now dispersed). Roberts, on the Dissenting Academies Online project, aimed to create a “virtual library” system able to comprehend multiform library catalogues and records including author catalogues, short list catalogues, borrowing registers for 12,000 titles, 45,000 borrowings and over 600 borrowers. What was described was a process of tagging that enables the user to track borrowing by individual “borrower profiles” and the borrowing of individual books; profiling the development and use of a particular library collection over time; and to reveal shelving habits and systems. Mueller’s collaboration with the Hurd Library (the still-extant library of Bishop Richard Hurd (1720-1808)) also aimed at a “virtual” library, but by through digital visualization. Using shelving catalogues and the few surviving original shelf marks together with digital images of the shelves and a digital schematic loaded with data may enable users to research how this man of letters interacted, not only with the books in his collection, but also  with the space of his library. The data mapped into the visualization would be garnered from Hurd’s annotations, letters and entries in his commonplace books. While I have to declare an interest in the Hurd Library collaboration, it seems to me that these two projects have an important contribution to make in rethinking library history.

But design is only one half of the process, and while designing digital archives involves thinking carefully about the questions a user asks of the archive, two panellists on the ‘Digital Humanities and the Archives’ roundtable raised interesting questions about the ways and results of searching a digital archive for the user’s perspective (in both cases here, this was ECCO). Bill Blake (NYU) asked “what makes a good keyword search”, and produced a list of popular search terms (“slavery” coming top). He suggested that many users had an impulse to “retrieve” rather than “search” and that the poorest keyword search terms effectively reproduced what was in the archive (one of the most popular search terms “slavery” was a good example of this). He argued that the best searches operated on a conceptual level. Indeed, that is what I’ve been training my own students to do, many of whose first try at ECCO was using a broad topic-based search term: they discover that the results of such search terms are useless and relatively quickly begin to think about the processes involved in deciding on a better search term (a factor I thought Bill Blake’s paper rather underplayed). Sayre Greenfield (Pittsburgh) posed a rather different problem with search results: what about “interpreting lack of results”? He argued that one can only “confirm the validity of negative results” by comparison to positive results elsewhere. Using the example of a phrase search “Ay, there’s the rub” resulted in only two (!) hits in ECCO; searching the Burney Collection resulted in a much larger number of hits, evidence that in the eighteenth century this particular phrase of Shakespeare’s inhabited the “cultural micro-climate” of journalism and not literary discourse (ECCO doesn’t include journals and newspapers).

Managed serendipity anyone?

The present and future of digitisation projects: an interview with George Williams and Seth Denbo

I was very lucky to have the chance to talk to two of the leading voices on digital humanities when they very kindly agreed to take part in a filmed discussion at ASECS annual meeting, in San Antonio, March 2012. George Williams is an associate professor of English (specialising in the 18thC) at the University of South Carolina and will be familiar to many from the ProfHacker pieces in The Chronicle of Higher Education; Seth Denbo is a historian of eighteenth-century England and involved with MITH, Project Bamboo, the IHR Seminar in Digital History and is on the faculty of the Maryland Institue for Technology in the Humanities. (Using iMovie to film the discussion in my hotel room was a bit of an experiment – which is by way of an apology for any impairment in sound and /or visual quality. The interview is split into two parts).

Improving ECCO part 2

Part of the excitement is the further option to create – and be credited as editor of – an entire text from your corrected OCR text. Gale’s release of the texts though 18thConnect to be corrected by TypeWright aims to have those texts re-imported in Gale’s database. But it seems Gale is also offering the chance for those corrected texts to be published either (possibly via 18thConnect or at least peer-reviewed by them) as digital editions or via Gale as a print text.

Now this is the odd point – what does Gale get out of releasing into the wilds of the open-access world its texts? ECCO isn’t cheap and a number of universities have spent a considerable amount of money for it; even JISC’s one-stop interface for both EEBO and ECCO isn’t much cheaper. Gale’s income would presumably suffer. One might be tempted to think that both of those moves to wider access suggest Gale’s anxiety over the continuing authority of ECCO (with its old OCR software, its reliance on microfilmed texts and small images) and the sustainability of this kind of database publishing model. One need only look at databases such as London Lives, or the William Godwin’s Diaries or the Digital Miscellenies Index to see where digital resources are going. It looks as if Gale is trying to maintain ECCO’s relevance by opening it up to wider access, paradoxically undermining potential income. Perhaps they figure that the market for ECCO is saturated and that there is nothing more to loose: they would reap the kudos from keeping up with the general thrust of more recent digital resources towards open access (there’s probably a buzzier-sounding phrase than that, I’m sure). As for those texts that would be released for publication outside of ECCO, they might figure that this would amount to only selected areas or authors and that the vast majority of texts on ECCO (non-canonical and found only through specialist searching) would be unaffected and so would continue to be the USP of ECCO.

Interesting times.