The Citation Conundrum

There is an unknown – but probably shockingly large – number of public domain texts on the web. Many of these could be of value to students and scholars. Lots of digital texts have page numbers which can be straightforwardly referenced in papers and publications. For example the journal article, the scanned monograph, born digital word processing documents, and so on. But how should we cite large public domain texts without pages or page numbers? Let’s call this the ‘citation conundrum’.

First of all, we might wonder about the long term prospects of the page. Usually physical books divide texts into pages more or less arbitrarily. Many document formats divide texts into pages, presumably partly so that they can be easily printed. Many digital devices enable dynamic formatting where the page divisions change with with the size of the font. Should we accept that page numbers are a thing of the past, a convenient metaphor, but one which will not be with us for much longer?

Perhaps in the future we’ll cite line numbers? Perhaps we’ll just search for the passage we’re after? Perhaps the whole textual estate of humankind will be retrofitted with hyperlinks? Perhaps we’ll have algorithms to help us identify the referents of references which no longer refer, obscure relics from a barely recognisable age when people had to butcher trees to capture their thoughts.

Perhaps to all these perhapses. But what until then? Until then people who work with public domain digital texts need to be able to find and refer to passages within them, in adherence with established stylistic principles, practises and standards. I can think of two options.

Firstly we can eschew page numbers in favour of other referential mechanisms. Technically, providing a URL with a date of access is sufficient. The MLA also provides guidance on citing ‘digital files’, which include PDFs, word processor documents, scanned images and so on (Rule 5.7.18). Presumably anyone who wants to put a passage into context can do a plain text search. Or we can use anchors or line numbers to point to precise parts. In this scenario the page number is replaced with a (hopefully) persistent URL.

Secondly we could introduce new (arbitrary) page numbers, or use the page numbers of some (arbitrary) public domain edition of the work we want to cite. Many of the works available on Wikisource have had their page numbers stripped out, and Project Gutenberg has an explicit policy to remove them. So either we can rather laboriously re-insert page numbers from some printed edition, or generate an arbitrarily paginated digital edition (as a digital file, via URL) which can be cited.

I’m very keen to learn more about what other have said, thought or done about this – partly so we can bear this in mind when building TEXTUS and Do you know of an interesting approach, paper, standard, or plugin? If so please do leave a comment below.

This entry was posted in bibliography, digital, digitalhumanities, humanities, ideas and tagged . Bookmark the permalink. Comments are closed, but you can leave a trackback: Trackback URL.
  • To receive new posts via email, you can sign up here


  1. Posted February 14, 2012 at 6:43 pm | Permalink

    Classical texts already rely on a system unrelated to pagination (see here), so maybe digital texts could imitate, to some degree, this ancient model. Very different historical circumstances seem to bring up the same problem.

    “Scholary” — that is, cite-worthy — editions of digital texts could include hierarchical numbered sections (at the very least, chapters and paragraphs?). With a digital text, it should be very easy to show and hide these section numbers, so they wouldn’t have to impede the reading experience. If somebody wants a citation, then, they just click “show section numbers” and they’ll have a specific location to use.

    However, if this were to work across multiple manifestations of the same text (common for the public domain, especially), some “authority” would have to designate those sections, and other digital editions of the same texts would have to standardize according to the established system for that particular text.

    There’s one idea, at least.

  2. Posted February 14, 2012 at 7:32 pm | Permalink


    Interesting analyse about a very complexe problem. Brief remarks… Don’t forget that a reference to a page of a book is not an exact and a direct reference to the citation. (Hyperlinks can refere exactly to the position of a citation in text.) Any one who have to find a citation to page 42 in an old newspaper will understand. References in index are also not very precise, and when a page content 2-3 text by 2-3 authors, index reference are poor in semantic. I don’t open my Bible very often, but the last time I did it, I understand that the bible’s publisher create the “perfect” system for citation. “Jn, 2, 4” refere to a semantic fragment of the text, and not to a physical location of the document. Every reader of any edition of bible, in any language, will find, or should, find the same sentences, meaning. I know that different of editions of the Bible exist, and differences exists. But the general system is good. Sure, that need a canonical text and a system of references. It’s work because references of a citation are independant of the physical and material device (book, edition, etc.). Citation is a part of a TEXT not a part of a book: it’s important to understand that to imagine a new way to link citation and text. Same thing for a “fragment” of a image. The Joconde’s smile is a part of Vinci’s works: everybody could find it in any of millions reproductions of Mona Lisa. Her smile is not a part of the real, material painting (wich is an unique chef-d’oeuvre). How can we make a universal reference to Hamlet’s “To be or not to be”? Pages, editions, reading devices are useless to do that. The only way is to refere to Shakespeare’s works itself, independantly of his materialisation on a page of paper, an Ipad screen, etc. Not easy task! And it’s why billions hyperlinks on the Web works, in any screen, even if you change de size, fonts, etc. (if the website or the page move!). I don’t have the solution, but I’m sure that the solution will be semantic, and the solution will have to be more precise, more direct, than the traditionnal page system. We don’t need the same system for the numeric text, we need a better one.

  3. Posted February 14, 2012 at 8:41 pm | Permalink

    Quotations from within digitized public domain texts are easy to find through search, either web search or local device search. As long as they are unique within a book, the book identifier (isbn, url, name and edition information, or whatever) and the quote are sufficient to find it in context. This turns a problem of organization into a problem of search, which is what has happened with more general managing of digitized texts.

    Citing in a more structured way can use those divisions of texts that are not dependent on pagination: chapter and section titles, section and subsection numbers, and paragraph, sentence & word numbers. This can be made manageable for human use through structured text that can be searched or processed algorithmically to find the correct chapter/section/paragraph/sentence, or by rendering texts on-screen or in print like poems with paragraph and/or sentence numbers rather than stanza numbers in the margins.

    I personally use use page and paragraph number followed by the first few words of a sentence to identify sections of a text in my notes for later reference. e.g. 72.4 – This is not… 102.2 – Looking out of the window… 212.3 – Sample based music can…

    In the absence of pages, I’d use paragraph numbers within the chapter.

    Any of these strategies can be represented as urls.

  4. Posted February 14, 2012 at 8:42 pm | Permalink

    Formatting died in paragraph 3 above, each number should be on a new line. 🙂