Blog/20131114 Mediawiki offline

From Bjoern Hassler's website
Jump to: navigation, search

More from B's blog:

Some older entries are here.

1 What process would you recommend to create a static offline version of a mediawiki?

What automated process or script would you recommend to create a static offline version of a mediawiki? (Perhaps with and without parsoid?) (See also here [1].)

I've been looking for a good solution for ages, and have experimented with a few things. Here's what we currently do. It's not perfect, and really a bit too cumbersome, but it works as a proof of concept.

1.1 How it works

To illustrate: E.g. one of our wiki pages is here:

http://orbit.educ.cam.ac.uk/wiki/OER4Schools/What_is_interactive_teaching

We have a "mirror" script, that uses the API to generate an HTML version of a wiki page (which is then 'wrapped' in a basic menu):

http://orbit.educ.cam.ac.uk/orbit_mirror/index.php?page=OER4Schools/What_is_interactive_teaching

(Some log info printed at the bottom of the page, which will provide some hints as to what is going on.)

The resulting page is as low-bandwidth as possible (which is one of our use cases). The original idea with the mirror php script was that you could run it on your own server: It only requests pages if they have changed, and keeps a cache, which allows viewing pages if your server has no connectivity. (You could of course use a cache anyway, and there's advantages/disadvantages compared to this more explicit caching method.) The script rewrites urls so that normal page links stay within the mirror, but links for editing and history point back at the wiki (see tabs along the top of the page).

The mirror script also produces (and caches) a static web page, see here:

http://orbit.educ.cam.ac.uk/orbit_mirror/site/OER4Schools%252FHow_to_run_workshops.html

Assuming that you've run a wget across the mirror, then the site will be completely mirrored in '/site'. You can then tar up '/site' and distribute it alongside your w/images directory, and you have a static copy, or use rsync to incrementally update '/site' and w/images on another server.

There's also a api-based process, that can work out which pages have changes, and refreshes the mirror accordingly. (One of the problems with this is to do with transclusions. The time stamp of a page only changes if the page is edited. However, the html for the page can also change if a template changes. So it's not as straight forward as it might look!)

Edit: Or so I thought! At a discussion at the WMUK meetup last Saturday, Magnus Manske pointed out the "touched" property to me, see e.g. this api query! There are still some unresolved issues, but it's a good step forward! Bjoern 21:48, 19 May 2014 (BST)

1.2 Build into mediawiki?

Most of what I am using is in the mediawiki software already (i.e. API->html), and it would be great to have a solution like this, that could generate an offline site on the fly. Perhaps one could add another export format to the API, and then an extension could generate the offline site and keep it up to date as pages on the main wiki are changing. Does this make sense? Would anybody be up for collaborating on implementing this? Are there better things in the pipeline?

I can see why you perhaps wouldn't want it for one of the major wikimedia sites, or why it might be inefficient somehow. But for our use cases, for a small-ish wiki, with a set of poorly connected users across the digital divide, it would be fantastic.

So - what are your solutions for creating a static offline copy of a mediawiki?

2 Addendum

If you're new to mediawiki, and happen to be reading this: I think mediawiki is an excellent platform for (public) content. Public here means: Registered users can edit, world can read. It's not a CMS/LMS in the sense that it doesn't (easily) allow the content to available only to certain user groups. But if you are happy to have your content available publicly, then it's excellent. Here is why.

  • It's a massively used platform, because of wikipedia. So people may well have experience.
  • There's a mobile front end built in (e.g. http://en.m.wikipedia.org), that automatically reformats content for mobile.
  • There are export facilities to pdf (where sets of wikipages can be exported to pdf)
  • It's possible to make the site offline (though there isn't a fully official process, but there's an API. More work is needed, but it's possible in relatively sensible ways), and Zim/Kiwix provides a searchable interface both on desktop and server.
  • There's a visual editor rolling out, allowing both use of wiki markup and "WYSIWYG". For new users (possibly people who don't yet have great digital literacy!) this is really important. I believe that this was prioritised as a feature after usability research on wikipedia (or something like that).
  • There's a collaborative editor (a la etherpad) in the making, that allows real-time collaboration on wiki pages.
  • There are mobile apps for Wikipedia and beta apps for mediawiki (via Kiwix).

There are drawbacks of course, e.g. some may not like the use of php, and I guess overall the code has grown organically over the years, and thus may have various historical idiosyncrasies which with hindsight could have been done differently. Also possibly development is slanted towards mass-use (like wikipedia) which in principle is good for performance, but also means that sometimes features really useful for smaller sites are side-lined, because they wouldn't work en mass (if you see what I mean).

So while these technical things may deter some from the media platform from a software development perspective, I would say that those concerns are nowhere near strong enough to rule out mediawiki, because of the user-centred advantages above. I know that some consider other software/technologies/wikis to be neater, but, from a user-centred perspective, the above features aren't bloat or "nice to have", but actually an ecosystem promoting high accessibility and usability.

3 Another addendum

I just ran wget across our site to refresh the mirror and offline copy. Wget took three hours (running from the same server that also hosts the wiki). This took 3 hours, and wget reports "downloaded: 13204 files, 119M in 1.1s (108 MB/s)", which included 'w/images'. Although it download some images, this doesn't include everything. The breakdown of our site is as follows:

  • The wiki pages (as static html), 37M
  • Assets uploaded to mediawiki (stored in "w/images"), 1.2G
  • Videos (external to mediawiki), about 6GB (via our 'download manager', or YouTube) in higher resolution, and about 3.2GB is normal resolution (itag=18, see this blog post). As mobile video, it's less than 1GB, but also the quality is quite poor.

4 Another another addendum

One issue is that when putting the offline wiki onto FAT32 memory sticks, the fact that FAT32 is not case-sensitive can become a problem.

Edit: One idea here (suggested by Tom D at Aptivate) could be to use NTFS instead. It's case sensitive, and supported for reading on Windows (obviously), OS X, and Linux. Under OS X and Linux it would be harder for end-users to write to the stick, but that's not not essential. Bjoern 22:40, 20 May 2014 (BST)

Edit: I've given this a go for the new version of our http://www.oer4schools.org offline stick. For writing, on OS X / Linux, NTFS is not fast, so it's pain to make the stick. (Write speeds of about 4-8Mb/s, i.e. slower than USB-1.) We're duplicating them on Windows, to avoid this. The other thing that occurred to me is that people may well copy the stick onto another stick, which would then be FAT, and hence the case-sensitivity issue arises again. It's not ideal, but in the absence of other bright ideas to make filenames unique in lower case, we'll add a note about this. So perhaps NTFS is still preferable overall. Bjoern


2013-11-14 | Leave a comment | Back to blog Share on Twitter Share on Facebook