Mediawiki mirror

From Bjoern Hassler

Jump to: navigation, search
The "Mediawiki accessibility project"

General considerations - Mirroring - file size extension - email interface - W:Mediawiki OER export - [Edit]

Broader background for this on Access2OER.

For a more technical discussion of mediawiki mirroring and synchronisation see Mediawiki mirroring and synchronisation.

"Mediawiki mirroring and synchronisation"

Mediawiki mirroring and synchronisation - Mirroring - moving - Offline mediawiki (mvs, MWEclipse) - API/Mirroring - Mwlib - OAI - [Edit]

Contents

[edit] 1 Idea

The purpose of this project is to create a copy of mediawiki (such as wikipedia), that can be used on a local area network, so that (e.g.) students at an African university can access the global mediawiki (e.g. wikipedia) at high speed without consuming bandwidth.

The majority of users (as for the live copy of wikipedia) will want to just read. So it's ok to just have local read access, while for editing you need to go to the real wikipedia. (By 'ok', Bjoern means that it's a fair balance between development effort involved and use cases. To do full read/write mirroring of a wiki is a hard problem.)


[edit] 2 Why mirror a mediawiki?

Many important OERs are presented as a mediawiki site, including wikieducator.

Suppose you have developed resources on wikieducator, to use with your own students. So what happens if your connection goes for a week or a month? Of course there is the pdf printing service (which is great!), but what if didn't print, or you want your students to access the material online, and your international bandwidth doesn't support it?

Here is a software demonstrator http://www.ict4e.net/mirror/wikieducator/index.php

The code uses the mediawiki api to render pages, which are then cached (together with version information). Pages are updated if the corresponding wikieducator page is newer. Pages are smaller (due to lack of javascript and css), and (if this was running on my intranet) I can also access all (previously visited) pages locally, even without (national) internet connectivity.

For example, here are two pages (measured with Yslow), one with a number of images, the other without significant images:

mirror without images mirror with images original page
Page: Main Page [1] [2] Main Page
Sizes 19k 48k 119k
Page: WikiEducator:First_Community_Council_Elections [3] [4] WikiEducator:First_Community_Council_Elections
Sizes 8k 10k 89k

So at least it's a saving of 50%, but if you turn images off, you might save as much as a factor of 10 or more.

In Bjoern's view, this may well be a 'sweet spot' application: Very little code, very easy to install (if you have php), but gets you a good set of functionality (transfering data as needed). Of course a few more things would need to be done, such as caching of images, and some page names with special characters don't work yet.

[edit] 3 A basic implementation for read-only access: Offline mirror

If basic (perhaps intermittent) online access is available, and the main aim is to reduce bandwidth and make access more stable, the mediawiki api allows local caching.

Example script here:

(See also http://www.wikieducator.org/Access2OER.) We are also using this script on the RECOUP manual

and on ICT4E itself:

In the first two examples, the mirror scripts are run on a local server, while the main sites reside elsewhere.

For the last example, the ICT4E wiki as well as the mirroring script happen to be on the same server. This is still useful, as it reduced bandwidth. Even when browsing with Opera mini (see mobile), the mirrored content is much faster.

[edit] 3.1 Implementation details

Some additional /details are available here.

[edit] 3.2 MediaWiki-API

For more details, see MediaWiki-API (content moved there). The offline mirror described above uses php/curl to retrieve pages. However, one could rewrite this using MediaWiki::Api in perl, which would make it more flexible, because the API could be leveraged with MediaWiki::Api. (Otherwise one would have to create php bindings for the API from scratch: Making requests, and parsing results.)

[edit] 3.3 Mediawiki/OAI mirror

Another approach is to use the OAI extension.

See Mediawiki/OAI mirror

[edit] 3.4 OpenZIM

See OpenZIM

[edit] 3.5 Use the mysql dump

http://www.openzim.org/Wiki2html

[edit] 4 Manually copy content ever so often (Copy, branch)

See MWM for how to move content between wikis. This just creates a copy, and if you edit the copy, you're branching. If the copy is read-only, you can keep editing on the main site, and then update your local copy ever so often. For instance, you could move the pages from one wiki into a read-only namespace on another wiki, with some notes that the content can be edited on the main wiki. See MWM.

Issues:

  • It's hard to do incremental updates: It's not possible to just export incremental changes from a wiki, but you have to take whole pages. That's ok for one page in an environment with good connectivity, but it's not good for many pages in an environment with low connectivity.
  • Updates don't cover images.

[edit] 5 Read-write access + Synchronisation

To store an offline copy of a wiki, one should consider whether the online synchronisation needs to be automatic, or can be made manually.

  • A global, self-synchronising mirror. This has the issue of potentially diverging page histories, and is a hard problem.
  • A cvs like system, that synchronises on demand, with user-based resolution of conflicts.

The latter might go quite a long way, and is certainly a pre-requisite to fully automatic synchronisation (with page branching if necessary).

[edit] 5.1 A CVS-like interface for mediawiki: mvs

For more details, see mvs. (content moved there)

[edit] 5.2 Global, self-synchronising mirrors: Merging branches

The main issue with having a set of global self-synchronising mirrors in merging branches.

Some references to doing version management within a wiki here:

This talk will present a new wiki-based research project from the MIT Media Laboratory designed to allow collaboration between partially diverged documents or articles. The system is based on ideas and code from software-based distributed revision control systems but provides a text-based and wiki-like interface of branch tracking and accounting, history sensitive merging, and conflict presentation and resolution. The system has important applications for collaboration between forks of Wikimedia projects (e.g., the original proposal for Citizendium), collaboration between branches of articles within Wikis during major revisions of articles, and conflict resolution during normal editing or extended offline work.

[edit] 5.3 Peer-to-peer wiki

[edit] 6 Specific to wikipedia

[edit] 6.1 Legal questions

Legally speaking, is this allowed?

You should note this about wikipedia: "Some mirrors load a page from the Wikimedia servers directly every time someone requests a page from them. They alter the text in some way, such as framing it with ads, then send it on to the reader. This is called remote loading, and it is an unacceptable use of Wikimedia server resources.", see [5] and [6].

The above code only reloads the page if it has changed on the server, but it might still come under 'remote loading'. However, it should be ok if it was on the local network. The above quote refers to issues due to search engines: For every mirror, the load due to search engines on the main wikipedia site doubles.

You might be able to reconcile this as follows: you could stick the present application onto your public network, but use (e.g.) an Apache .htaccess file to redirect based on IP address: Internal people see the local version (that is remote loaded, but not public), others get redirected to the main wikipedia site.

[edit] 6.2 Technical questions

The easiest way to mirror wikipedia might be the following:

  • Install mediawiki
  • Get the database dump
  • install the database dump locally
  • Modify links to do with users and editing
    • Modify the 'edit' link on article pages to point to the live wikipedia site (so that people don't edit your local copy)
    • Modify the 'log in' link to point to the live wikipedia site


[edit] 6.3 Tutorials

http://en.wikipedia.org/wiki/Special:Version

http://modzer0.cs.uaf.edu/~dev2c/wiki/How_to_mirror_Wikipedia

[edit] 6.4 See also

http://tinderblog.wordpress.com/2008/11/21/offline-wikipedia/

Also see MWM for moving data around between wikis.