s-HACK-speare!, or, Building Bill-Crit-O-Matic

So, I built Bill-Crit-O-Matic, a site that pulls the TEI data of Shakespeare's Comedy of Errors into a site to help learn about the scholars and scholarship in the Modern Language Association's edition of the text.

Here's a description of the project, with an aim toward some of the technical aspects along with a nod here and there to their broader implications for crit-code and higher education. A partially-overlapping description with an aim toward use cases can be found on Bill-Crit-O-Matic's about pages.

Background -- MLA New Variorum Challenge

The prompt for this project is the Modern Language Association's New Variorum Shakespeare Digital Challenge. From their site:

The MLA Committee on the New Variorum Edition of Shakespeare is sponsoring a digital challenge and is seeking the most innovative and compelling uses of the data contained within its recently published volume, The Comedy of Errors.

The MLA has released the XML files and schema for The Comedy of Errors under a Creative Commons BY-NC 3.0 license. Scholars may freely download these files from GitHub and use this material in their research.

We are seeking innovative new means of displaying, representing, and exploring this data and are thus holding a competition to find the most exciting API, interface, visualization, data-mining project, or other use of The Comedy of Errors XML.

The Idea

The core idea of this project comes from a reflection on a transitional time in my scholarly career, when I was a graduate student moving into being a scholarly commentator on texts. There was a moment during graduate school (I suspect others encounter it, too, at some point), in which reading standard scholarly editions of a text was no longer about reading the text. Instead, it was about reading the scholarship surrounding the text. As a reading practice, that reversal of privilege -- the commentary becomes primary while the "main" text becomes secondary -- is probably familiar to people who are familiar with medieval manuscripts and/or the Talmud and/or various other textualities that visually mix text and commentary in more sophisticated ways than our modern impoverished footnotes can sustain.

Thus, the guiding principle of this project is to turn the scholarship around Shakespeare's Comedy of Errors on it's head. Instead of reading the playtext as the main (privileged) content, the site privileges the scholars and their scholarship on the text. The overarching mission is to find new ways to get familiar with scholarship of the Comedy of Errors. To find connections between their commentary. To enter the text not from the playtext, but via the scholarship and the scholars. It's an inversion of the textuality of the printed scholarly edition.

Thus, this site is not and cannot be a replacement for the MLA's edition. You cannot read through the playtext in any useful way, and you cannot read through the appendix or commentary either.

Think of it as an annotated bibliography meets match.com. Get to know the scholars in the MLA Variorum edition, where their profile is their own commentary on the text.

Multimedia

Despite the dearth of images, video, visualizations, and other assorted pretty pictures, this is a multimedia text. More precisely, it is a dual-media text. One text is the site itself. The other text is the printed edition of the MLA Variorum Comedy of Errors.

Read them together. Read through the playtext in print, and -- as I know you will do -- notice when something in the commentary or appendix catches your interest. Then, come to this site to find new ways to expand that connection. Are you interested in a particular section of the appendix? See more about the scholars in the appendix here. Are you interested in a particular character? Start here. Have you developed your interests enough to want to bring things together by searching on a word or phrase? Here.

Or -- and this is more in line with the philosophy of inverting the scholarly text -- start with a concept, and discover the scholars and scholarship you need to know about. For example, imagine that you (or your students) are getting to grips with Lacanian approaches to Shakespeare. Starting with the appendix section "Identity" might be a good place to start looking for the scholars and scholarship that might be of interest. That approach would start with reading through the discussion there, keeping an eye out for the key terms that make you think "Lacan!", making a note of the citations, then turning to the bibliography to see the relevant scholarship. That's the printed edition approach to such a question.

The Bill-Crit-O-Matic approach moves in the other direction. Go to the search page and search for "Mirror" in the appendix. That will yield you the text of paragraphs with that word. Here's one that looks promising:

Clicking the "Note Bibliography" button shows the bibliography for that paragraph to the right. From there, I could bop to either more info about the citations themselves by clicking "View". But I'm trying to get familiar with the scholars, so I'll follow the link to Janet Adelman.

When I open up the "In Conversation With" info, I can see another name that I hadn't picked up before, and continue following my nose to pick up more starting points in the commentary and in the playtext that might be relevant.

Of course, the next thing to do would be to go back to the print edition, which contains the argumentative context for the discussion, and the literary context for the passages. That's one way I imagine the two media -- site and print -- working together and complementing each other by offing reciprocal entry points to the text and scholarship.

Why Omeka?

Honestly, the way this was initially conceived it should have been a Drupal site. In general, if you have a lot of complex relationships between different pieces of content, Drupal is the strongest CMS I know for that.

But I don't know it well enough to have tackled the importing processes from MLA's TEI. And so, as always with (side) projects on a deadline, a certain technological choice was based on expediency and practicality.

Then again, lately I've been deep in developing for Omeka, but that inevitably leads to a dissociation from actually building with Omeka. I wanted to try something that would push on some boundaries of Omeka in a practical way. It gave me some things I wanted to do, and needed to figure out if/how Omeka could do it.

Finally, I'm a big fan of pushing here and there on what Omeka claims it is. We often talk about it as an easy way to publish information about "cultural heritage artifacts". That's a handy catch-all for the various things that galleries, libraries, archives, and museums work with. But what else could/should fit into this category, especially online where "artifact" becomes less tangible?

I played a while ago with the idea of the data produced by the United States government being a cultural heritage artifact and playing with it in Omeka. Here, instead of the text of Comedy of Errors being treated as the artifact, I'm treating the scholarship and scholars surrounding the play to be the "artifacts" and bringing them to the fore.

"Items" in Omeka

One interesting part of bringing the MLA TEI data into Omeka was considering just how to do so, and where to place the data. In Omeka, the primary data (in informal CS-speak, the 'first-class citizen') is the Item. Omeka makes little to no restrictions on what, exactly, an "Item" is. A painting? Sure! A sculpture? Why not! A person? What d'hey! A lesson plan? Dandy!

Omeka also includes Item Types, which let you differentiate between Items based on the needs of your own project, and the differentiation is mostly, though not necessarily, predicated on a differentiation of the metadata that is relevant to each item type.

This is where things got interesting. I had a premise of reversing the privileged position of the playtext in favor of the scholarship of Comedy of Errors. Seemed like that implied that what I wanted to emphasize -- the scholars and their scholarship -- should then conform to the principles of data privilege in the system. Scholars and their scholarship would be items -- first-class citizens of the system -- and other surrounding content should be something else.

I mostly stuck to that. The exception to that is characters in Comedy of Errors. I made them Omeka items, too. That's because I also thought of characters as another entry point into understanding Comedy of Errors scholarship. That is, I wanted people to be able to ask who talks about, and what is said about, a particular character.

That means that, in this Omeka site, the notion of an "Item" is ultimately an assertion of what I want the entry point for further exploration to be. The direct entry points are scholars, bibliography entries, and characters. (Indirect: search, and filter by headings in the appendix).

It carried with it some consequences, though. First and foremost, was that I couldn't alter the data model to add some data that I would have liked. For example, I didn't want to lose things like the XML ids in the TEI. I could have stuffed those in as Dublin Core Identifiers, but that seemed like a bit of a stretch, and would have made it difficult in cases where more than one identifier would be appropriate (e.g., the sigla for editions). So, for the items imported, there is also a partially-duplicate model more focussed on the TEI. The TEI-oriented data carries things like XML ids, line number data, the original TEI itself, and the HTML derived from it. In the site, what is displayed ends up being a mix of the two.

Working With The TEI

Importing the data into Omeka

Since, while very technically the data is not in TEI, for most practical purposes, it's TEI, so I could start with the standard TEI XSLT for the basic job of creating HTML. Due to different versions of the TEI and the stylesheets, though, after some experimentation it looked like version 5.59 of the XSLT came out closest. The major exception was in the commentary notes. After more experimentation, I switched to using the stylesheet in the TEI Display plugin for Omeka developed a while ago by Scholars Lab.

There were relatively minor modifications to those, mostly aimed at removing transformations to HTML that ending up being more clutter than help in the display and slight changes to the attributes. One notable modification was the need to change some of the ids generated into class names, since in the various combinations of content I couldn't rely, for example, on the same bibliography entry not appearing more than once on the same page (see below on linear text vs. web pages). Granted, part of that is a weakness of duplication in my own code and approach, but the duplication simplified some of the display and javascript work.

Just getting the HTML representation, though, would have missed the point of aiming toward a web site that works by connections (see above on "Why Omeka"). That meant doing a fair amount of processing on the data as it was being imported to build those connections from the data in the TEI. In turn, that meant some careful ordering of the kinds of content. For example, building a relation between a commentary note and the scholars mentioned in it required importing data about the scholars first. Here's a summary of the import sequence.

  1. Roles: Just pull in the names of the characters
  2. Speeches: Pull in data from the speeches in the play (<speech>). This makes the connections back to the roles, and also does some post-processing on the HTML generated by XSLT to convert the <a>s into <span>s for the lines (see below).
  3. Stage Directions: Simpler version of above, just sorting out the line breaks into spans
  4. Bibliography and Editions: This pulls in the bibliography entries and the individual scholars and builds relations there.
  5. Commentary Notes: This is where things get complicated. each <note> becomes a record in Omeka. Then, the related bibliography entries get parsed out, and relations to those are built, as well as relations to the speeches and stage directions, both the immediate referent of the note and any other speeches that are mentioned within the note. That meant lookup up the speech that contains the line(s) referred to. For good measure, I also chased through the bibliography material to build relations directly between the notes and the scholars.
  6. Appendix Paragraphs and Notes: A similarly complex process of importing in <p/> and <note/> in the appendix TEI and chasing through the speeches and bibliography entries referred to, plus the direct connection between scholars and the paragraphs. Here, I also looked up the headings in the appendix and turned those into tags.

Any Linked Open Data / RDF fans among those still, amazingly, reading this will see a graph if you squint hard. My approach here is heavily influenced by RDF graphs, but there won't be SPARQL queries anytime soon. Pulling out some actual RDF might be possible, though.

Representing the document vs online publishing

Part of the interesting process of importing the data was discovering just how aligned the NVS is to the linear text. This made things tricky to put together a non-linear, fragmented and linked, site. The most prominent example is line breaks (<lb/>). In the XML, these are empty elements, acting essentially as anchors. Indeed, the standard XSLT stylesheets do exactly that, converting <tei:lb> to <html:a>. This works very well for the process of creating a print document: it just signals where to spit out a line number.

But online, I wanted a way to highlight a particular line number. For example, when looking at some commentary that contains references to particular lines, it makes sense to be able to highlight the relevant line. The XML in the commentary does nicely represent the link between the reference and the line itself with <ref> elements, e.g. <ref targType="lb" target="#tln_0104">104</ref>.

But to work with those in HTML, the empty <a>s derived from <lb>s needed to become <span>s across the relevant lines.

Thus, after doing the XSLT to get HTML for each speech in the play, there was some post-processing of the HTML to make the switch:

        //change anchors into spans around the text node to the next anchor
        $aNodes = $doc->getElementsByTagName('a');
        foreach($aNodes as $a) {

        $span = $doc->createElement('span');
        $span->setAttribute('class', 'line');
        $tlId = $a->getAttribute('xml:id');
        $span->setAttribute('id', $tlId);
        $textNode = $a->nextSibling;
        if($textNode) {
            $span->appendChild($textNode);
        }

        $a->parentNode->appendChild($span);
    }

    while($aNodes->length != 0) {
        $a = $aNodes->item(0);
        $a->parentNode->removeChild($a);
        $aNodes = $doc->getElementsByTagName('a');            
    }
Names

Some of the names of scholars presented an interesting challenge for consistency. I'm not really sure if differences between names that strongly seemed to identify the same person were due to fidelity to the original references, or the inevitable variations that crop up in creating a new document. Either way (unless I missed in the XML), it would have been nice to work with an authority file, with references from the XML to those authorities.

An interesting set of instances like this was the appearance of <author> and similar elements that contained pairs with and without the <date>s for a person. For example:

<author>Steevens, George (<date>17361800</date>)</author>

and

<editor>George Steevens</editor>

in the coe_bibliography.xml file, plus

<name type="app">George Steevens</name>

in coe_front.xml

Most of those were fairly easily taken care of with the blunt instrument of normalizing by some string comparison:

case 'George Steevens':
case 'Steevens, George (1736–1800)';
    $textContent = 'Steevens, George';
    break;

However, the representations of Charles Clarke's and Mary Cowden Clarke's work proved to be a curiously encodingo-gendered twist:

  1. <name type="app">Charles &amp; Mary Cowden Clarke</name>
  2. <author>Clarke, Charles Cowden</author>
  3. <author display="book(ldash)">Clarke, Charles Cowden</author>, &amp; <author>Mary Cowden Clarke</author>
  4. <author>Clarke, Mary</author>
  5. <author>Cowden Clarke, Charles & Mary</author>

Notice how we see both Mary and Charles as one <author>, and the two as two distinct <author>s.

Hopefully ORCID will play a role here in the future.VIAF lists both as surname "Clarke". (Tip o' the pen to DigiKeri_SIL for the VIAF and to wynkenhimself for info via Folger Shakespeare Library ) . It looks like the printed edition uses "Clarke, Charles Cowden" and then "-----, & Mary Cowden Clarke" in the bibliography. Somehow, that doesn't feel right. Probably because my last name is "Murray-John". But among the European branch of my patrilinear family, it's "Murray John". "Cowden Clarke, Mary" and "Cowden Clarke, Charles" seem right to me. Either way, there are many layers of representation at work here.

This encoding is good fodder for a little crit-code work, I think. But more, my handling of this when I noticed it is also interesting from that perspective. I'm frankly a little sheepish and embarrassed by how I handled it. Or rather, by the fact that I haven't yet handled it.

See, from a coding perspective, this is particularly annoying. It was bad and resource-intensive enough to go through the text comparisons. But here, to programmatically excise the gender imbalance here would call for lots of lines of code to look for this one special case and do a lot of unique processing. It's doable (I think). But it would be a fair number of lines of code and work to handle 5 cases among over 700. On a deadline of needing to get the site up before the deadline for the competition, I punted on addressing this as I should for now.

Interesting case-study for crit-code and feminism, yes?

Anomolies, Issues, Pull Requests, and Tenure

Other anomolies and quirks I discovered in the TEI were teeny once discovered, but surprisingly annoying to discover and address since they were often isolated to one or two cases. I've listed them as issues on the NVS data in GitHub. I mention it here to again raise a question that I posted to twitter since it got a few retweets: What if submitting issues (or, even better, a pull request) to GitHub counted as a scholarly contribution, kinds of like the short scholarship in Notes & Queries. Little things like these, I think, could have an out-sized influence on scholarship that works with this base data -- once someone discovers a quirk in the data, no one else should have to re-discover it. This save researchers a lot of time and improves the base data. Surely a short article that improves a base text used for research would count as a scholarly contribution, so why not issues or pull requests?

I think it's a solid argument that could be made to tenure and promotion committees. More importantly here, since the MLA is explicitly inviting researchers to work with the data, it seems like something for them to consider as they develop their guidelines for such committees.

Site Construction

I was strongly tempted to work off of the emerging Omeka 2.0 code. It affords a lot of nifty things that could have made this project easier. There are better search mechanisms (thanks to Jim Safley) and cleaner code throughout (thanks again to Jim and especially to lead developer John Flatness),

But it was a combination of weakness of my skills and the fast development track that Omeka is on that led me to keep it to the current Omeka release, Omeka 1.5.3.

That said, some of the plugins I am using are primarily intended for Omeka 2.0 -- and have been developed as part of another site-specific grant, the Omeka Commons. I've hacked them here to make them work with Omeka 1.5. The Zotero-like groups, parts of commenting, and most of user profiles are only slated for stable release with Omeka 2.0.

Upshot: much of the site is explicitly unstable.

There are also some short-cuts in play. Instead of building true view pages, I often resorted to stuffing the PHP I needed into a Simple Page.

The most interesting site-construction challenge using Omeka was the fact that it isn't really designed with the alphabet in mind. While it is easy enough to sort items alphabetically by any Dublin Core element by adding some parameters to the URL, there is not a built-in mechanism for finding items that, for example, have a title starting with "B". Not so good for a site that needs to show something akin to a bibliography.

So I needed to build in some rudimentary alphabetization work. It's in Omeka's recipe for browsing items alphabetically.

Then, there was mostly just tweaks and hacks to the Emiglio theme. The biggest work there involved branching view pages to different templates based on the item type.

There was some final cleanup, too, once the data import process was working correctly. I edited the tags (appendix headings) to reflect their hierarchy in the document (UPDATE: I underestimated how long a job that would be manually -- I'll have to code up an automated way to do that), and names that were represented as "given-name family-name" in the TEI were just manually edited to "family-name, given-name".

Add comment

"Any medium powerful enough to extend man's reach is powerful enough to topple his world. To get the medium's magic to work for one's aims rather than against them is to attain literacy."
-- Alan Kay, "Computer Software", Scientific American, September 1984

Search form

Info about apps mentioned

I'm patrick_mj on Twitter

Subscribe to

© Patrick Murray-John. All content is CC-BY. Drupal theme by Kiwi Themes.