On making digital editions of public domain works for teaching and research
May 28, 2012
There are lots of freely available public domain texts on the internet, but not all of them are immediately suitable for use in teaching and research in their current form. The following post looks at challenges and opportunities in this area based on a brief survey of existing online resources.
As an example, I will focus on the works of Johann Georg Hamann, an 18th century German philosopher, who is one of the figures I am looking at in my academic research.
The TEXTUS project aims to address some of these problems and to make it easier for scholars and students to contribute to (i) comprehensive bibliographies and (ii) scholarly editions.
Which works are available on the web?
There are a few places we might start to look. First of all, there is a reasonably good starting bibliography in the article on Hamann in Stanford Encyclopaedia of Philosophy.
However, this does not link to digital editions and many of the editions cited are still under copyright and hence not freely available on the web.
We can also browse works attributed to ‘Johann Georg Hamann’ on Open Library.
This is incomplete and includes many books which are still in copyright. It includes links to digital copies of some of the works hosted at the Internet Archive, which is useful.
Doing a text-filtered search for Hamann on Europeana gives a very nice looking overview of which works are available from libraries across Europe. There are lots of leads off to different scans of Hamann’s works at different libraries.
However, what we often want is not just to know which multi-volume scholarly editions exist, but which works are in these editions. Hence we either need an index for the edition, or – better still – a list of works linking through to (the relevant part of) digital editions.
The German language Wikisource page for Hamann starts to do this, including a chronological list of many of his important works from 1758 to 1784. This is still a work in progress and – as far as I can tell – there are not yet any of his works on Wikisource. Links are included to scans of his works on the Internet Archive.
One of the most promising resources is the complete scan of a 9 volume collection of Hamann’s works, edited by Friedrich Roth from 1821-1843. All 9 volumes of the edition are linked to from the German language Wikisource page on Hamann. We can browse the volumes using the Internet Archive’s BookReader interface.
If you are used to the blackletter typeface, this is a perfectly good reading copy of Hamann’s works. However, if you want to search, copy and paste or comment on the text, you will soon run into difficulties. Here is the title page of one of Hamann’s earliest works, Sokratische Denkwürdigkeiten:
Here is the Internet Archive’s scrambled plain text version:
There is currently no mechanism to correct or update the garbled plain text version of works on the Internet Archive. Archive staff recently told me that there were no plans to do this – though they said that in principle they would be interested in ingesting corrected plain text versions.
Wikisource currently has a simple system for proofreading and correcting plain text versions of scanned texts, which could be built on. Here is an example page from the Wind in the Willows:
There are plain text versions of several of Hamann’s works scattered on the web, but these do not always say where they are originally from or who created/reviewed the transcriptions. This means that it is not clear how to cite them or to verify whether they are accurate, so their value in teaching and research may be somewhat limited.
What does the future look like?
Here are a few thoughts on things that could be done by TEXTUS and other projects to improve the provision of digital editions of public domain works for teaching and research:
- Comprehensive, Machine Readable Bibliographies. There should be a mechanism to enable scholars to help to create and curate canonical scholarly bibliographies of primary sources. At the moment Wikisource is – de facto – one of the places where this is happening. But ideally the bibliographies should be machine readable, and easy for users to sort, search, correct and add to. In particular this should not just include printed editions, but ideally works (and even parts of works – essays, letters, poems, etc) within these editions. Ideally metadata could be easily imported to citation management systems like Zotero, cross-referenced with library catalogues and so on.
- Plain Text Versions of Works, Linked to Scans. There should be a mechanism to enable scholars to create and correct plain text versions of works. For example it should be easier for users to import a work from the Internet Archive to Wikisource so they can help to correct texts generated by OCR software. Ideally plain text versions should be linked to scans of public domain editions wherever possible, so their content can be verified against the original.
- Better Connected Resources. Major public domain content initiatives like the Internet Archive, Europeana, Wikisource and Wikimedia Commons should be better interlinked and ideally should work together more closely to understand and address the needs of users. It would be wonderful if it were really easy for users to contribute to these projects by having their hand held through the process of uploading, describing, and transcribing texts. Also it would be useful if users could easily find relevant content available on other projects through some kind of federated search interface.