Thursday, 6 November 2008

Digital Preservation Policies Study published

JISC appointed Charles Beagrie to develop this Digital Preservation Policies Study back in March with a pretty small timeframe to deliver the goods. It's well worth a look if you're putting your own digital preservation policy together, especially if you're operating in the HE sector. The study provides a handy template of policy clauses that can be adapted for local needs. That should help people get started.

If I'm honest, it's the emphasis on context and mappings that I like best about the work (as an archivist I'm bound to like that ;-), right?). By examining policy documents in the areas of research, teaching and learning, information, library and records management, the study has identified how digital preservation supports the work of universities. This alignment of digital preservation policy to the business of the University is critical to answering questions about why digital preservation matters. Anyone needing to make the case for digital preservation should take a look at the detailed mappings to these wider university policies in the appendices.

Friday, 24 October 2008

Slides from Digital Archives meeting

These are some slides from a talk given at a meeting on Digital Archives hosted by the Andrew W. Mellon Foundation back in September. They give an overview of what the futureArch project is about.
20080903arsenalsofnemesis 04
View SlideShare presentation or Upload your own. (tags: digital archives)

Thursday, 23 October 2008

iPres presentations now online

This year iPres was hosted by the British Library. Much food for thought, so much so that choosing between parallel sessions was something of a challenge. Good news then that the presentations and full papers are now available online.

Wednesday, 1 October 2008

Fun with tag clouds

Not the traditional form of indexing an archive, I know, but it seems to me that automagically extracted metadata formed into tag clouds would be a marvelous way of navigating through some digital archives.

We could present clouds at different levels of granularity - at the collection level, in series and lower levels all the way down to the item. We could even present clouds across multiple aggregations, be they of series, collections or items. This could be fun.

For some digital archives, I think tag clouds are probably a 'must'. Poorly structured and overly large email archives are a good candidate.

One of the downsides of the 'hybrid archive' is that we can't necessarily generate tag clouds that draw on all the contents of the archive. All 'physical' material and non-textual digital formats are excluded unless these things are already tagged by creators. They can, of course, be tagged later by cataloguers and/or users. I guess that we need to recognise that imbalance in our user interface, to help our users get to grips with the nature of research in a hybrid archive.

I know that automatic metadata extraction may have shortcomings, but I'd really like to see a fusing of standardised subject headings with tag clouds. We can have the best of both worlds, surely?

There have been lots of examples of tag clouds about recently, including TagCrowd and Wordle.

This is a Tag Crowd entry for this blog...

created at

Saturday, 27 September 2008

XML Schema for archiving email accounts

I attended several great sessions at the Society of American Archivists conference last month. There is a wiki for the conference, but very few of the presentations have been posted so far...

One session I particularly enjoyed addressed the archiving of email - 'Capturing the E-Tiger: New Tools for Email Preservation'. Archiving email is challenging for many reasons, which were very well put by the session speakers.

Both the EMCAP and CERP projects were introduced in the session.

EMCAP is a collaboration between state archives in North Carolina, Kentucky, and Pennsylvania to develop means to archive email. In the past, the archives have typically received email on CDs from a variety of systems, including MS Exchange, Novell Groupwise and Lotus Notes. One of the interesting outcomes of this work is software (an extension of the hmail software - see sourceforge) that enables ongoing capture of email, selected for archiving by users, from user systems. Email identified for archiving is normalised in an XML format and can be transformed to html for access. The software supports open email standards (POP3, SMTP, and IMAP4) as well as MySQL and MS SQL Server. The effort has been underway for five years and the software continues to be tested and refined.

CERP is a collaboration between the Smithsonian Institution Archives and Rockefeller Center Archives. This context has more in common with archiving email in the Bodleian context, where an email account is more likely to be accessioned from its owner in bulk than cumulatively. Ricc Ferrante gave an overview of the issues encountered, which were similar to our experiences on the Paradigm project and in working with creators more generally.

CERP has worked with EMCAP to publish an XML schema for preserving email accounts. Email is first normalised to mbox format and then converted to this XML standard using a prototype parser built in squeak smalltalk, which also has a web interface (seaside/comanche). The result of the transformation is a single XML file that represents an entire email account as per its original arrangement. Attachements can be embedded in the XML file, or externally referenced if on the larger side (over 25kb). If I remember rightly, the largest email account that has been processed so far is c. 1.5GB; we have one at the Library that's significantly larger and I'd like to see how the parser handles this. It will be interesting to compare the schema/parser with The National Archives of Australia's Xena. The developers are keen to receive comments on the schema, which is available here.

Monday, 11 August 2008

Keeping the user experience in the browser

A few days back, news that Barrack Obama is using Scribd prompted me to take another look at this document sharing site. I'm interested in user interfaces at the moment because developing interfaces for curators and researchers wanting to use hybrid archives will be an important part of futureArch's work.

I'm quite taken with iPaper's features. I should explain that iPaper is the flash technology developed by Scribd to display documents published to the sites by its users. Documents load quickly within the browser and the functionality is similar to that of Adobe Acrobat. You can search text, there's a thumbnail view, you can zoom, launch the document to display at full screen, turn the pages, etc. It also has some 'social' features (which may or may not be useful in this context), and it claims to be a more secure document format than PDF.

iPaper can display files encoded in a number of formats, a feature that may well prove useful for developing a browser-based interface to your typical born-digital archive. Imagine wanting to view a handful of items in a collection, each of which is in a different format. Rather than launching a different application to render each format (even if these are available as browser plug-ins), the user could access all the items using a single lightweight viewer that can be embedded in the web page itself. This normalisation for presentation simplifies access for repositories and users, and for those users interested primarily in content, a single viewer would provide a convenient and predictable experience that requires no software installations.

So far iPaper supports these formats:

* Adobe PDF (.pdf)
* Adobe PostScript (.ps)
* Microsoft Word (.doc/ .docx)
* Microsoft PowerPoint (.ppt/.pps/.pptx)
* Microsoft Excel (.xls/.xlsx)
* OpenOffice Text Document (.odt, .sxw)
* OpenOffice Presentation Document (.odp, .sxi)
* OpenOffice Spreadsheet (.ods, .sxc)
* All OpenDocument formats
* Plain text (.txt)
* Rich text format (.rtf)

For users interested in qualities beyond content, technologies such as iPaper may be less useful. On example is users who require an experience that reflects the context of creation; use of the iPaper format and viewer requires the transformation of the original item into iPaper format and its rendering in an environment quite different to that of its creation. Another example is users wishing to use specialist analytical tools, which might be domain- or format-specific.

There are some aspects of Scribd itself that appeal to me too. A collection overview pane sits next to a list of child items in the collection, each of which have a thumbnail, page-count, format indicator and brief abstract. I'm sure we could do something similar in a view for archival collections, though an archive repository would more likely point to series descriptions from the collection level, and to items at lower levels of the archive's hierarchy. Scribd's item-level view works well too: the document is displayed in the page (using the iPaper viewer) and a little metadata is available on the right - some tags, rights information (creative commons), relevant categories, etc., and since this is web 2.0, users are able to add their comments.

Other possibilities?
I've also been investigating another means of enabling browser-based delivery for the kinds of file formats found in a born-digital archive. More and more of us want to create and view data at the network level and we want to be able to do it with a variety of devices. This can only mean that more options are going to be available, but, as ever, the creative direction is unlikely to correlate exactly with our requirements.

One possibility is javascript lightboxes, so long as the usual accessibility issues are addressed. Many of these tools are designed for image galleries, but there are some with other functionality too. Highslide is one I've spent some time looking at, and now version 4 is just out (five days old) it might be time to take another look. Perhaps the subject of a future post.

Thursday, 31 July 2008

DPC's preservation planning workshop

Earlier in the week I attended a DPC workshop on preservation planning, which was largely constructed of material coming out of the European project called Planets, which is now half-way through its four-year programme. There were also interesting contributions from Natalie Walters of the Wellcome Library and Matthew Woollard of the UK Data Archive.

A preservation system for the Wellcome Library?
Much of what Natalie had to say about the curation of born-digital archives chimed with our experiences here. Unlike us though, Wellcome are in the process of evaluating 'off the shelf' systems to manage digital preservation. They put out a tender earlier this year and received five responses that seem, in the main, to demonstrate a misunderstanding of archival requirements and the immaturity of the digital curation/preservation marketplace. One criticism was that the responses offered systems for 'access' or 'institutional repositories' (of the kind associated with open access HE content - academic papers and e-theses). This is something we also felt when we evaluated the Fedora and DSpace repositories on the Paradigm project (admittedly, this evaluation becomes a bit more obsolete day by day). Balancing access and preservation requirements has long been an issue for archivists, since we often have to preserve material that is embargoed for a period of time. I still believe that systems providing preservation services and systems providing researcher access are doing different things, but we do of course need some form of access to embargoed material for management and processing purposes. I also find the adoption of new meanings for words, like 'repository' and 'archive', tricky to negotiate at times. These issues aside, one of the systems offered seems to have held Wellcome's interest and I'll be keen to find out which one when this information can be revealed.

Preservation policy at UKDA
Matthew spoke about the evolution of preservation policy at the UKDA, which had no preservation policy until 2003 despite celebrating its 40th anniversary last year. The first two editions of the policy were more or less exclusively concerned with the technical aspects of preserving digital material, specifying such things as acceptable storage conditions and the frequency with which tape should be re-tensioned. The latest (third) edition embraces wider requirements including organisational/business need, user requirements (designated community and others), standards, legislation, technology and security. The new policy increases emphasis on data integrity and archival standards, it defines archival packages more closely to provide for their verification, and it pays attention to the curation of metadata describing the resources to be preserved.

If I understood correctly, the UKDA preserves datasets in their original form (SIP), migrates them to a neutral format (AIP1) and creates usable versions from the neutral format (AIP2). All these versions are preserved and dissemination versions of the dataset are created from AIP2. The degree of processing applied to a dataset is determined by applying a matrix which assigns a value on the basis of likely use and value. These processes feel similar to those evolving here, though we need to do more work to formalise them.

Matthew also showed us a nice little diagram from 1976, which was created to document UKDA workflow from initial acquisition of a dataset to its presentation to the final user. The fundamentals of professional archival, or OAIS-like, practice are evident. The UKDA's analysis of its own conformance with the OAIS model undertaken under the JISC 04/04 Programme is worth a look for those who haven't seen it.

Towards the end of the talk Matthew reminded us that having written a policy, one must implement it. It's not normally possible to implement every new thing in a policy at once, but the policy is valueless without mechanisms in place to audit it. Steps must be taken to progress those aspects of the policy that are new and to audit compliance more generally. The policy must also be available to relevant audiences who can evaluate the degree to which the archive complies with its own policy for themselves. I found this a very useful overview of the key issues involved in developing a preservation policy and the resulting policy itself is very clear and concise.

Planets tools for preservation planning
It's great to see the promise of Planets starting to be realised, especially since we plan to build on the project's work in relation to characterising material, planning and executing preservation strategies. Andreas Rauber kicked things off with an overview of the Planets project, which helped to demonstrated how the various components fit together.What is uncertain at the moment is how the software and services being developed by Planets will be sustained beyond the project's life. Neither is it clear what licensing model/s will be adopted for different components in the project, since there are the needs of commercial partners to consider as well as those of national archives, libraries and universities.

Christoph Becker gave us an overview of Plato, a tool which allows the user to develop preservation strategies for specific kind of objects. In Plato, users can design experiments to determine the best available preservation strategy for a particular type of material. This involves a formal definition of constraints and objectives, which includes an assessment of the relative importance of each of these factors. Factors might include:

* object migration time - max. 1 second
* object migration cost - max £0.05 per object
* preserve footnotes - 5
* preserve images- 5
* preserve headings - 4
* open format required - 5
* preserve font - 3
* and so on...

These are expressed in an 'objective tree', which can be created directly in Plato or in the Freemind mind mapping tool and uploaded to Plato. Objective tress can be very simple, but the process of creating a good and detailed objective tree is quite demanding (we had a go at doing this ourselves in the afternoon). In future we should be able to build on previous objective trees as these are developed and that will ease the process. For the moment the templates provided are minimal because the Plato team don't want to preempt user requirements!

The user must also supply a sample of material which can be used to assess the effectiveness of different strategies. This should be the bare minimum of objects required to represent the range of factors expressed in the objective tree. The user then selects different strategies to apply to the sample material, sets the experiment in motion, and compares the results against the objective tree. The process of evaluating results is manual at present, but there are plans to begin automating aspects of this too. Once the evaluation is complete, Plato can produce a report of the experiment which should demonstrate why one preservation strategy was chosen over another in respect of a particular class of material.

Plato is available for offline use, which will be necessary for us when processing embargoed material, but it is also offered as an online service where users can perform experiments in one place and benefit from working with the results of experiments performed by others.

The Planets work on characterisation was introduced by Manfred Thaller. This work develops two formal characterisation languages - the extensible characterisation extraction language (XCEL) and the extensible characterisation description language (XCDL). The work should make it possible to perform more automatically determine whether a preservation action, such as migration, has preserved an object's essential characteristics (or significant properties). It is expected that the Microsoft family of formats, PDF formats and common image formats will treated before the end of the project.

One of the interesting aspects of the characterisation work is developing an understanding of what is preserved or not in a particular process and how a file format impacts on this. Thaller demonstrated this (using a little tool for *shooting* files) by deliberately causing a small amount of damage to a png file and a tif file. A small amount of damage to the png file had severe consequences for its rendering, while the tif file could be damaged much more extensively and still retain some of its informational value. Thaller also used the example of migrating a MS Word 2003 document to the Open Document Text format. The migration to ODT seemed to lose a footnote in the document. Thaller then showed the same MS Word 2003 document migrated to PDF, where the footnote appears to be retained. In actual fact the footnote isn't lost in the migration to ODT, it's just not rendered. On the other hand, the footnote is structurally lost in the PDF file, but visually present. Thaller is proposing a solution which allows structure and appearance to be preserved.

The final element of planets on show was the testbed developed at HATII, demonstrated by Matthew Barr. The testbed looks very useful and, like Plato, will be available for use online and offline. There did seem to be some overlap in aims and functionality with Plato, but there are differences too. It's essential objectives seem similar - users should be able to perform experiments with select data and tools, evaluate those experiments and draw conclusions to inform their preservation strategy; the testbed will also tools and services to be benchmarked. It struck me is that the process of conducting an experiment was simpler than with Plato, since a granular expression of objectives is not necessary. It's more quick and dirty, which may suit some scenarios better, but will the result be as good? Aspects I found particularly interesting were the development of a corpora and the ability to add new services (tools are deployed and accessed using web services) for testing.

Monday, 21 July 2008

Seeking a Software Engineer

We are looking for a Software Engineer to work on the futureArch project. You can read the job advertisement at the vacancies section of the University of Oxford's website; there is also a link to the further particulars from here. Closing date is 29 August 2008.

Wednesday, 16 July 2008

Annotating sound and video

At the JISC innovation forum earlier this week, I was fortunate enough to run into an improptu demo of Synote by Mike Wald of ECS, who had hijacked the British Library's sound archive project stand. Well, perhaps 'hijack' is a little strong - the BL demo was pretty much done and Peter Findlay was happy to tune in to what Mike was showing.

Synote is a rather nifty tool which lets users add annotations to specific points in a digital sound or video recording. These annotations might be notes, tags, or images; they act like bookmarks - they can be returned to easily as and when the need arises. Synote uses a transcript of the audio, which can be generated by speech recognition software if the audio is clean enough, or compiled by hand if not. The transcript plays alongside the content, and the users' annotations are highlighted in it; clicking on a word in the transcript allows the user to skip ahead to the bookmark and, of course, the transcript is searchable. It's been designed as a teaching and learning tool, but I think it has a lot of possibilities as a means of interacting with audio and video content present in archival collections. The project has a sourceforge page, so hopefully we'll be able to have a go ourselves in due course.

Wednesday, 2 July 2008


Hello world, as they say. Welcome to this new blog, which is a place for those of us working with born-digital archives at the Bodleian Library to share our thoughts, frustrations and successes. We'll also be making a note of interesting or useful things we stumble upon.

We've been working on issues relating to the long-term preservation of digital archives for a few years now. If you take a look at our Paradigm and Cairo projects, that should give you an idea of the kinds of issues we're dealing with.

This blog is being born as we launch an important phase of development at the Library. We're about to begin the futureArch project, which will see us move the curation of born-digital archives and manuscripts from a series of small projects to a sustainable activity integrated with other aspects of the Library's operations. When futureArch concludes, in just over three years time, we aim to have embedded the curation of born-digital archives and manuscripts into the way we do things.

That's probably more than enough for a first post...