Friday 26 November 2010

What I learned from the word clouds...

Now, word clouds are probably bit out of fashion these days. Like a Google Map, they just seem shiny but most of the time quite useless. Still, that hasn't stopped us trying them out in the interface - because I'm curious to see what interesting (and simple to gather) metadata n-grams & their frequency can suggest.

Take for instance the text of "Folk-Lore and Legends of Scotland" [from Project Gutenberg] (I'm probably not allowed to publish stuff from a real collection here and choose this text because I'm pining for the mountains). It generates a "bi-gram"-based word cloud that looks like this:

Names (of both people and places) quickly become obvious to human readers, as do some subjects ("haunted ships" is my favourite). To make it more useful to machines, I'm pretty sure someone has already tried cross-referencing bi-grams with name authority files. I also imagine someone has used the bi-grams as facets. Theoretically a bi-gram like "Winston Churchill" may well turn up in manuscripts from multiple collections. (Any one know of any successes doing these things?).

Still, for now I'll probably just add the word clouds of the full-texts to the interface, including a "summary" of a shelfmark, and then see what happens!

I made the (very simple) Java code available on GitHub, but I take no credit for it! It is simply a Java reworking of Jim Bumgardner's word cloud article using Jonathan Feinberg's tokenizer (part of Wordle).

No comments: