Blog/20091102 Feeding a repository

From Bjoern Hassler's website
Jump to: navigation, search

Interesting post on Phil’s JISC CETIS blog, on Feeding a repository.

1 Depositing media vs. harvesting metadata

Phil starts by saying:

So what is this feed-deposit idea. The first thing to be aware of is that as far as I can make out a lot of the people who talk about this don’t necessarily have the same idea of “repository” and “deposit” as I do. For example the Nottingham Xpert rapid innovation project and the Ensemble feed aggregator are both populated by feeds (you can also disseminate material through iTunesU this way). But, (I think) these are all links-only collections, so I would call them a catalogues not repositories, and I would say that they work by metadata harvest(*) not deposit.
EnsembleOverview.gif

The above mentioned Ensemble work originated in the Steeple project. However, our original proposal for the Steeple Ensemble was to use a more 'repository'-like approach. The idea was to provide a 'library' (that has a catalogue and 'books', i.e. media), rather than just a 'catalogue'. I do think that search sites (like Xpert) are useful, because they allow you to find media. If the search site is well done, you're just one click away from the media. However, if the site is not well done, you still need several clicks to get to the media, and it's then not that useful.

Our view on this stems from using rss/atom to transport data+metadata between different places, e.g. between institutional video servers and web portals, and so we've usually taken a "transport data+metadata" approach, rather than "harvest metadata and provide links" (see below for the finer details!).

Of course different approaches suit different scenarios, and an important factor is the granularity of the resource. If you have many small resources (like images or audio/video), it's a pain to have to find them across several sites, so bringing data and metadata together is useful.

2 How do you syndicate/deposit lots of data?

The trouble is, though, that once you get down to details there are several problems and several different ways of overcoming them. For example, how do you go beyond having a feed for just the last 10 resources? Putting everything into one feed doesn’t scale. If your content is broken down into manageable sized collections (e.g. The OU’s OpenLearn courses and I guess many other OER projects) you could put everything from each collection into a feed and then have an OPML file to say where all the different feeds are (which works up to a point, especially if the feeds will be fairly static, until your OPML file gets too large). Or you could have an API that allowed the receiver of the feed to specify how they wanted to chunk up the data: OpenSearch should be useful here, it might be worth looking at YouTube as an example.
Master feed 1.png

Within Steeple we've taken a slightly different approach, and decided to not use OPML. Instead of using opml, we advocate using rss or atom. This is because opml is weakly defined, and usage is not completely uniform.

One advantage of choosing rss/atom rather than OPML is more 'symmetry': both the 'index' feed and the 'media' feeds themselves have essentially the same structure, and can be validated and parsed in the same way.

Another advantage is that you can just chain your feeds, i.e. having an 'index' feed for the institution, that links to 'index' feeds for departments, and then to 'media' feeds. So the approach is scalable.

It's of course a rather static approach, but that's an advantage, because it makes it easy for the provider. The data+metadata can then be pulled into another service, that can provide a plethora of ways of accessing the information.

3 The flavours of RSS

Then there are similar choices to be made for how just about every piece of metadata and the content itself is expressed in the feed, starting with the choice of flavour(s) for RSS or ATOM feed.

There's an interesting continuum from just harvesting metadata to more concrete approaches.

  1. OAIPMH, which basically works for metadata harvesting, but doesn't really allow you to get directly at the data. The specification is just to general and flexible, and often you can tell what sort of media is associated with the metadata, but not in a particular machine readable format.
  2. More specific than this is RSS and Atom. The format is stricter, and data is now bundled as enclosures. Yahoo Media adds support for multiple enclosures.
  3. However, even RSS/Atom/Yahoo media isn't quite enough. It's nearly enough, but not quite. There are some loose ends, which need to be tied up, if you really want to do 'feed depositing', or (as we would say in Steeple) push your data to a web portal, iTunesU, etc.

We've tried to do this through the Steeple syndication format. This is 95% atom/rss/yahoo media, but clarifies the use of those formats, and make a few minor additions (http://purl.org/steeple), which we are hoping to feed back to yahoo media for inclusion in the yahoo media standard.

We should add that currently we don't create copies of media: For video, it's useful to do the heavy lifting on a dedicated server, and for the front end to just do html serving. However, all the information is there, so that data could be cached or even transformed. Currently, we just do this for images, because often users provide very large images alongside their feeds, that are not suitable for embedding in html. But the same approach could be taken for audio/video or other files: There are no issues in caching these programatically. (On a tangent, there are certain advantages to doing this, such as the ability to create more accessible formats.)

Because we are more interested in the syndication/web portal angle, we see 'caching' as transient. However, if you were running a repository, you could just 'cache indefinitely', i.e. deposit. The feed formats certainly provide for versioning, so that you know when you need to update the media in your repository. It's interesting to note that this mirrors the approaches taken by YouTube and iTunesU. In Europe, iTunesU only caches transiently, to help with load on institutional servers. In the US, iTunesU started by hosting media (in a repository fashion), which is also the approach taken by YouTube. To the user, this makes little difference. As far as they are concerned, the media is on YouTube or iTunesU, and can be consumed from there (without needing to follow a link to another site).

4 A demonstrator

Over the years, I've build various demonstrators, that illustrated these principles, including ScienceLive (bringing science video into one site) and CamTV (bringing together all video from Cambridge into one site). These early demonstrators were also applied to build the first Steeple portal: http://podcast.steeple.org.uk

More recently, we've put together a new demonstrator site, illustrating our approach, and allowing verification of our feed format: http://www.opencontent.org.uk

Bringing content together like this is important for two reasons:

  1. It is useful to have the site, and users may find it more useful in certain circumstances to be able to browse media like this.
  2. The site motivates agreement on (details of) feed formats, allowing other interesting things to happen.

At the moment, the aggregator primarily aggregates video from Steeple partners, but we've put some open content into this, just to see how it goes. In my view, this approach is scalable to more general open content as well, and is important, e.g. for use of open content in the context of international development (see forthcoming paper in IEEE special issue ...)


2009-11-02 | Leave a comment | Back to blog