Friday, 26 November 2010

What I learned from the word clouds...

Now, word clouds are probably bit out of fashion these days. Like a Google Map, they just seem shiny but most of the time quite useless. Still, that hasn't stopped us trying them out in the interface - because I'm curious to see what interesting (and simple to gather) metadata n-grams & their frequency can suggest.

Take for instance the text of "Folk-Lore and Legends of Scotland" [from Project Gutenberg] (I'm probably not allowed to publish stuff from a real collection here and choose this text because I'm pining for the mountains). It generates a "bi-gram"-based word cloud that looks like this:

Names (of both people and places) quickly become obvious to human readers, as do some subjects ("haunted ships" is my favourite). To make it more useful to machines, I'm pretty sure someone has already tried cross-referencing bi-grams with name authority files. I also imagine someone has used the bi-grams as facets. Theoretically a bi-gram like "Winston Churchill" may well turn up in manuscripts from multiple collections. (Any one know of any successes doing these things?).

Still, for now I'll probably just add the word clouds of the full-texts to the interface, including a "summary" of a shelfmark, and then see what happens!

I made the (very simple) Java code available on GitHub, but I take no credit for it! It is simply a Java reworking of Jim Bumgardner's word cloud article using Jonathan Feinberg's tokenizer (part of Wordle).

Wednesday, 10 November 2010

The as yet unpaved publication pathway...

It has been a while since we had a whiteboard post, so I thought it was high time we had one! This delightful picture is the result of trying to explain the "Publication Pathway" - Susan's term for making our content available - to a new member of staff at the Library...

Nothing too startling here really - take some disparate sources of metadata, add a sprinkling of auto-gen'd metadata (using the marvelous FITS and the equally marvelous tools it wraps), migrate the arcane input formats to something useful, normalise and publish! (I'm thinking I might get "Normalise and Publish!" printed on a t-shirt! :-))

The blue box CollectionBuilder is what does most of the work - constructs an in memory tree of "components" from the EAD, tags the items onto the right shelfmarks, augments the items with additional metadata, and writes the whole lot out in a tidy directory structure that even includes a foxml file with DC, PREMIS and RDF data streams (the RDF is used to maintain the hierarchical relationships in the EAD). That all sounds a lot neater than it currently is, but, like all computer software, it is a work in progress that works, rather than a perfect end result! :-)

After that, we (will, it aint quite there yet) push the metadata parts into the Web interface and from there index it and present to our lovely readers!


The four boxes at the bottom are the "vhysical" layout - its a new word I made up to describe what is essentially a physical (machine) architecture, but is in fact a bunch of virtual machines... 

For the really attentive among you, this shot is of the whiteboard in its new home on the 2nd floor of Osney One, where Renhart and I have moved following a fairly major building renovation. Clearly we were too naughty to remain with the archivists! ;-)