Tuesday, 17 November 2009

building castles 1: the problem

It has been an odd couple of days. You know how it is. A problem that needs solving. A seemingly bewildering array of possible solutions and lots of opinions and no clear place to start. In an attempt to bring some shape to the mist, I'm going to start at the start, with the basics.

The Raw Materials
  • A collection of things.
  • A set of born digital items - mostly documents in antique formats.
  • EAD for the collection - hierarchical according to local custom and ISAD(G).
  • A spreadsheet - providing additional information about the digital items, including digests.
The Desired Result

A browser-based reader interface to the digital items that maintains the connections to the analogue components and remains faithful to the structure of the finding aid and presents that structure in such a way as to not confuse the reader. Ideally the interface should also support aspects of a collaborative Web, where people can annotate and comment, as well as offer "basket"-like functionality ("basket" is the wrong term), maybe requests for copies and maybe even the ability to arrange the collection how they'd like to use it.

(I imagine you've all got similar issues! :-))

We put together a sketch for the interface to the collection for the Project Advisory Board and got some very useful feedback from that. Our Graduate Trainee Victoria has also done some great research on interfaces to existing archives and some commercial sites which provides some marvellous input on what we should and could build.

But this is where things get misty...

We have some raw materials, we have a vision of the thing we want to build (though that vision is in parts hazy and in parts aiming high! (why not eh?)), so where to we go from here?

(To put it another way, there are the foundations of a "model", a vision of a "view"; now we need to define the "controller" - the thing that brings the first two together).

  • We could build a database and put all the metadata into it and run the site off that

  • We could build a set of resources (the items, the sub[0,*]series, the collection, the people), link all that data together and run the site off that.

  • We could build a bunch of flat pages which, while generated dynamically once, don't change once the collection is up.

There is a strong contender for how it'll be done (the middle one!) and in the next exciting episode I'll hopefully be able to tell you more about the first tentative steps, but for now I'm open to suggestions - either for alternatives or technologies that'll help and if you have already built what we're after then please get in touch... ;-)


6 comments:

jjsomewhere said...

Convert all the born digital antique documents to PDF (but keep, and serve, the originals too). Set up a server running Fedora [http://www.fedora-commons.org/] then load all you files and data into that. Build an HTTP based public API for your Fedora repository coded in Python, running on Django and serving EAD/ISAD/others wrapped in XML and supporting the functionality you need, inc. storing user contributions & feedback. Use Python/Django to then also build an XHTML wrapper around the API for your web based user interface. Publish your API. Done.

pixelatedpete said...

Nice idea - thanks! Was all sounding plausible until you mentioned Python... ;-)

jjsomewhere said...

Substitute your language/application server of choice in place of python/django. But you're right, this isn't a solution that's viable without writing some code somewhere along the lines as Fedora had no user interface of it's own.

Caveat. I'm planning a similar approach on a project here in the UK in the near future, the suggested combination is based on research rather than first hand experience (yet).

pixelatedpete said...

More seriously, I am curious to know what I gain using Fedora in your outline as opposed to, say, a Web server?

jjsomewhere said...

A web server simply serves content, it doesn't manage it. Some web application servers do both, but not usually in a way specific to the needs of archives. Fedora does. So does Dspace [http://www.dspace.org/], which might be an alternative.

More at:
http://www.fedora-commons.org/about/features

Fedora would be capable of handling the storage, meta-data, search and administration side of your problem. After some setup work your code would only have to 'talk' to Fedora via it's API and tell it what to do. It would require dramatically less code than building from scratch, and the end result should be much more stable, manageable and extensible long term.

Specific to your issues, Fedora should be able to maintain the links between your analogue and digital data more easily (than say building the same thing with a relational database such as MySQL) because it would allow the construction of a custom data structure within Fedora using RDF that mirrors the structure of your analogue data. Fedora calls this it's 'Content Model Architecture'.

It also defines relationships between data more flexibly because it can store data as triples (the subject–predicate–object structure used by RDF and the rather still-born semantic web) and searches are defined using SPARQL, which performs a similar function to SQL but is designed for searching by complex relationships. A bit like a graph database. As the actual documents can be stored in Fedora maintaining data integrity across files stored in a file system wouldn't be an issue.

I'm sure there are also advantages in terms of data consistency checking, backup and repository restore after data corruption and a load of other stuff that I don't know about yet.

On the other hand Fedora is non-trivial to learn (I'm only just starting), requires a good deal of custom code, and might be overkill if your needs a very simple and unlikely to need to expand further later. At least that's my take, as a prospective new recruit to the Fedora world.

Jeremy.

pixelatedpete said...

Thanks for your comment - its nice to have active discussion on the blog! :-)

It sounds like you're building a repository - management as well as serving the content. I wasn't clear in my statement of the problem, but essentially what I'm after (for the moment) is an interface to copies of the items. The real things are managed elsewhere - and thus when you say "A web server simply serves content, it doesn't manage it" it is OK, because management is happening outside of this less ambitious application - that is why I asked about Fedora. In this instance I suspect just some way of addressing and resolving the content is enough.

I agree regards RDF & triples - this is essentially the "middle one" - each "resource" being RDF. I've already got a script that builds a bunch of RDF resources out of the EAD (for each file, component and the collection) and the next bit is to shove that into a triple-store and run to queries to see if I've done it right! :-)

For what it is worth, I've experience of both DSpace & Fedora... :-)