Monday, 27 April 2009

Wahcade Emulator Front-End

Now, you might think I've gone mad putting a link here to Wahcade. Either that or you'll think I've too much time on my hands and spend all my time digging out old games to play on this arcade machine manager. While making an arcade machine, case and all, sounds like a lot of fun (and one day when life is less busy I might just give it a go), I'm really making a note here to flag it as an interesting example of what we might do in reading rooms. I guess it is a bit like those media centre PCs (Mythbuntu for example) or the BL's "Turning the Pages" only this is a front-end for emulators.

Imagine that a collection is a "rom" (the rom being the image of a chip containing, in Wahcade's case, a game but for us could be a disk image from a donor's PC). The user picks from a list of roms and then the reading room "arcade" starts an emulator and away you go. Before you know it the dumb terminal is a replica Mac System 6 desktop complete with donor file system, etc.

Be neat wouldn't it?

Friday, 17 April 2009

Terrabyte Terror

I knew I'd come to the right place for work one morning when I was talking with Renhart and Susan and mentioned that I was very excited at having discovered a Maplin just minutes walk from the office. Instead making a hasty retreat from the conversation, both of them gave me knowing smiles and agreed that Maplin was wonderful!

From my early days watching my Dad teach electronics I've loved the smell of soldering, the look of components and the idea that you can make your own set of LEDs flash just for the fun of it. Thumbing through the Maplin catalogue with a cup of tea was once one of my favourite past times. But these days, more and more, I get a sense of dread as I check out the special offers.


Let me give you an example: 1TB External Drive, £99.
Here is another: 1TB Internal Drive, £89.

You read that right - 1 TerraByte of storage for under £100! Doesn't that make you quake? Probably not, but I can't help but wonder how long it will be before we have to accession a 1TB drive. What do we do with it? Do we even know what that amount of detritus accumlated over, well how long? a lifetime? a couple of evenings with iTunes? We don't know how long it'll take the average person to fill up a 1TB drive. Do we have the capacity to store 1TB of data and even if we do, how sustainable is that?

You could argue that since storage like this is so cheap, we can rest assured that our own storage costs will be less, so we always keep up with the growth of consumer storage. It is a fair point, but how many preservation-grade storage devices can manage 10p a GB? None I imagine, and for good reason. There is a whole lot more to a preservation system than a disk and a plastic case - it takes more than 1TB to keep 1TB safe for a start! (Mind you, I couldn't help but smile at Maplin's promise of "Peace of mind with 5 year limited warranty").

If we cannot keep up with the storage then, what do we do? A brute force method would be to compress the data, but then bit rot becomes a much more worrying issue (and it is pretty worrying already). We could look for duplicates - how many MP3 collections will include the same songs for instance and should we keep them all (if any)? What if it is the same song, with a different encoding/bitrate/whatever? What about copies of OSs - all those i386 directories? (Though arguably an external drive will not contain an OS, so we wont save space there).

We probably don't need or want to keep all of those 1000GBs, but how will we identify what to preserve? Susan and Renhart came up with some answers to this with their brilliant Paradigm project - which I'll paraphrase as "encourage the creators to curate their own data" - and I'm hopeful that will happen, but what if it doesn't? Will we see "personal data curation" and "managing information overload" added to the National Curriculum anytime soon? I hope so!

All of which finally gives me reason to stop worrying about cheap terrabytes! Data is going to keep growing and someone is going to have to help manage all that stuff. I guess that is where we fit in.

Monday, 6 April 2009

Validating normalised dates in XML

I had some fun (hmm..., maybe that's not the right word) a year or two ago with regular expressions, trying to come up with something that could validate the kinds of 'normalised' dates that archivists use. You know the ones. The fuzzy dates, the approximates, the uncertains, the 'it was in this decade, but I can't be more precise than that' date, the 'I can tell you the start-date, but not the end-date' (and vice-versa) date. To add to this, we now have the very precise dates associated with born-digital materials - down to the second complete with timezone. In the event, my problem was dispatched by the folks working on PREMIS, who created a union type that brings together some regular expressions to provide a fix (not perfect, but that's regular expressions for you). Just recently the Library of Congress have mounted some pages in the Standards section of their website, where they have put together a nice statement of the problem, as well as pubishing the union type and an XML document with some test dates. See their Extended Date Time Format page.

Friday, 3 April 2009

Draft data dictionary and schema for document significant properties

A data dictionary and related schema has been drafted for those documents that are largely text, but where creators can specify formatting, such as fonts, colours, text size and page layout; where they can embed images and other items; and where there might take advantage of application features, such as the ability to create annotations or page thumbnails. Specifically targetted formats are: OpenDocument Text, PDF, Staroffice, MS Works, MS Word and Wordperfect. Significant properties relating to appearance, behaviour, content and structure are recorded, and it's anticipated that this metadata could be plugged into PREMIS 2.0's objectCharacteristicsExtension.

The designers, from the California Digital Library and Harvard's University Library, are seeking comments from the digital preservation community. Semantic units are: PageCount, WordCount, CharacterCount, ParagraphCount, Line Count, TableCount, GraphicsCount, Language, Fonts, FontName, IsEmbedded, Features. You can see the current schema in full at

This looks like a useful addition to preservation metadata, provided tool support for extracting the information and populating metadata records follows. I think the list of values for 'Features' - isTagged, hasLayers, hasTransparancy, hasOutline, hasThumbnails, hasAttachments, hasForms, hasAnnotations - may need extending (hasFootnotes, hasEndnotes?), and it would be good to see some definitions and examples of the existing values.

I wonder if we need a different data dictionary and schema for slideshows? This one might be adequate with some additions to cover things like animations, timings, etc. Seeing this data dictionary also reminds me that we need to look at where the Planets folk are up to on their significant properties work (XCDL/XCEL).

Thursday, 2 April 2009

Digital preservation for individuals and small organisations

Hoppla is a prototype archiving toolkit, with in-built digital preservation capacity; it's not yet available to test, but it sounds very promising. It's being designed specifically for home and small office users by developers at the Institute of Software Technology and Interactive Systems at Vienna University of Technology. It supports preservation at the bit-stream level and includes functions for managing format obsolescence through migration pathways (don't know which objects/pathways). The system also records metadata about preservation actions and object characteristics. If this is reliable, unobtrusive and low maintenance, perhaps we could roll it out to some of the creators the Library works with. It's possible that the acquisition modules could be useful to accessioning archivists too.