Saturday 27 September 2008

XML Schema for archiving email accounts


I attended several great sessions at the Society of American Archivists conference last month. There is a wiki for the conference, but very few of the presentations have been posted so far...

One session I particularly enjoyed addressed the archiving of email - 'Capturing the E-Tiger: New Tools for Email Preservation'. Archiving email is challenging for many reasons, which were very well put by the session speakers.

Both the EMCAP and CERP projects were introduced in the session.

EMCAP is a collaboration between state archives in North Carolina, Kentucky, and Pennsylvania to develop means to archive email. In the past, the archives have typically received email on CDs from a variety of systems, including MS Exchange, Novell Groupwise and Lotus Notes. One of the interesting outcomes of this work is software (an extension of the hmail software - see sourceforge) that enables ongoing capture of email, selected for archiving by users, from user systems. Email identified for archiving is normalised in an XML format and can be transformed to html for access. The software supports open email standards (POP3, SMTP, and IMAP4) as well as MySQL and MS SQL Server. The effort has been underway for five years and the software continues to be tested and refined.

CERP is a collaboration between the Smithsonian Institution Archives and Rockefeller Center Archives. This context has more in common with archiving email in the Bodleian context, where an email account is more likely to be accessioned from its owner in bulk than cumulatively. Ricc Ferrante gave an overview of the issues encountered, which were similar to our experiences on the Paradigm project and in working with creators more generally.

CERP has worked with EMCAP to publish an XML schema for preserving email accounts. Email is first normalised to mbox format and then converted to this XML standard using a prototype parser built in squeak smalltalk, which also has a web interface (seaside/comanche). The result of the transformation is a single XML file that represents an entire email account as per its original arrangement. Attachements can be embedded in the XML file, or externally referenced if on the larger side (over 25kb). If I remember rightly, the largest email account that has been processed so far is c. 1.5GB; we have one at the Library that's significantly larger and I'd like to see how the parser handles this. It will be interesting to compare the schema/parser with The National Archives of Australia's Xena. The developers are keen to receive comments on the schema, which is available here.