MWAPI/Mirroring

From Bjoern Hassler's website
Jump to: navigation, search

1 Mirroring/replication/synchronisation of wikis[edit]

How would you use the Mediawiki API to mirror or synchronise two wikis? The use case is this: I've got an offline wiki (or a mirror somewhere) that I now want to update (rolling forward from a certain revision or certain date).

The general problem is tricky as you would have to merge and resolve conflicts. For general info see here: Mediawiki mirroring and synchronisation.

On this page, we just try to look at using the API to accomplish the first step:

Obtain changes since the last synchronisation date.

That is to say: get all activity on a wiki since a certain date or revision id. This is needed both for updating a read-only copy, as well as for proper synchronisation. I should add that we're looking at this from an international development perspective, where bandwidth often is a scarce resource, and hybrid mirroring/online/offline models are needed (more on Mediawiki access).

And so while it's possible to do a number of things with the API, but unfortunately it needs a number of requests. To illustrate, let's consider a number of cases. Suppose we have content in wiki A, and some of that content (up to a certain revision/date) in wiki B. We now want to update wiki B. (As explained above, we assume that wiki B is read-only, i.e. no edits have been made to B.)

2 Download the whole wiki[edit]

So every time you want to synchronise, you download the whole wiki. I.e. you download the whole of wiki A (including all files/images), and upload this to wiki B. (Effectively this is throwing away wiki B, and completely repopulating it with content from A.) Clearly possible in some sense, but in reality would only work for small wikis: It leads to potentially huge amounts of data transfer.

However, this nicely illustrates the problem: In principle, getting 'changes since a certain date' is possible, but there don't seen to be optimised ways.

3 Get recent changes[edit]

An api query with list=recentchanges would work, were it not for the fact that recent changes are periodically purged, so cannot be used reliably. $wgRCMaxAge can be adjusted from the default of one week to a longer value. Where the changes you are after are within the range of $wgRCMaxAge you could use this, otherwise you would have to fall back on an alternative method below.

4 Getting only the most recent version of a page[edit]

In many scenarios, we just want the most recent versions of pages that have changed since a certain time. This would be similar to this api query:

api.php?action=query&prop=revisions&generator=allpages&rvstart=20090521000000

which doesn't work. (Similarly for rvstartid instead of rvstart.)

The alternative is to

  • Fetch the list of namespaces
  • Get the list of revisions in each namespace (api.php?action=query&prop=revisions&generator=allpages for each namespace)
  • See what needs updating, and then fetch all the changed pages.

Note that if we only fetch the most recent revision, we need to check the log for moved pages also. The above isn't amazingly fast - you might have around half a sec per namespace (if you query sequentially). So one might want to combine this with the recent changes option, if the date is in range.

Again, diffs would be useful. And again, once we have the content, we then transform it to Special:Import/Export format, and can import it into the new wiki (for one way synchronisation).

5 Rolling forward from certain revision[edit]

Suppose I want to get all revisions (so that not just the most recent versions, but the full history is mirrored). If the offline wiki has most recent revision 1449, I can then roll forward like this:

action=query&prop=revisions&revids=1450|1451|1452|...&rvprop=content

It potentially needs to fetch a lot of pages, but unless we can do a diff (see below) that's unavoidable. Once we have the content, we then transform it to Special:Import/Export format, and can import it into the new wiki (for one way synchronisation).

If we fetch all revisions like this, do we still need to look at the logs? Any moved pages will just show up in the revisions? Moving a page doesn't seem to create a new page: It keeps the revision number, and just changes the title. The page we moved from get's a new revision. So the above method doesn't catch pages that were moved, and we need to look at the move log as well.

6 Other considerations[edit]

6.1 Images / Files[edit]

Images/Files also need to be treated: We need to extract uploads from the log, and fetch those files too.

6.2 Diffs[edit]

There are reasons why a diff isn't offered (see MW-API mailing lists and bug tracker): Basically it seems to be a computationally intensive procedure.

6.3 Compression[edit]

It would also be nice if output was offered in a compressed format - that might just be a question of configuring the web server properly.

7 Discussion[edit]