Mediawiki/OAI mirror/OAIRepository

From Bjoern Hassler's website
Jump to: navigation, search
"Mediawiki mirroring and synchronisation"

Mediawiki mirroring and synchronisation - Mirroring - moving - Offline mediawiki (mvs, MWEclipse) - API/Mirroring - Mwlib - OAI - [Edit]

1 Extension:OAIRepository[edit]

This page describes how to install the OAIRepository extension for mediawiki, and how to use it to mirror wikis. The instructions are based on the instructions on the extension page: mediawikiwiki:Extension:OAIRepository, and the talk page mediawikiwiki:Extension_talk:OAIRepository.

We'll use this extension to mirror wiki content on a 'server' (the 'source' wiki) to a client (the 'mirror' wiki). The client should be 'ready-only' (!). You won't be able to synchronise the two wikis (safely). For more, see Mediawiki/OAI mirror.

2 Install OAI extension on source wiki (oai server)[edit]

2.1 Get the extension[edit]

Get the extension files, and put them in w/extensions/OAI/

Running

svn co http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/OAI/

in your extensions directory (probably 'w/extensions' or 'wiki/extensions') should work.

The following steps to do with the mysql are a little long winded. Here's a perl script that does all of these instructions in one go: Mediawiki/OAI mirror/OAIRepository/install oai server.pl Use at your own risk - I haven't tested this extensively at all! If you run the script, you can skip modification of LocalSettings and mysql, but you still need to modify a php file (OAIRepo_body), see below.

2.2 LocalSettings.php[edit]

Amend LocalSettings file: Add to LocalSettings.php :

# OAI repository for update server
@include( $IP.'/extensions/OAI/OAIRepo.php' );
// $oaiAgentRegex = '/experimental/';
// $oaiAuth = true; # broken... squid? php config? wtf
// $oaiAudit = true;
$wgDebugLogGroups['oai'] = '/home/wikipedia/logs/oai.log';

The last line needs amending: choose a suitable directory.

2.3 Add OAI tables to the MySQL database[edit]

You then need to run some mysql on your database. In extensions/OAI, you've got

update_table.sql
oaiaudit_table.sql 
oaiharvest_table.sql
oaiuser_table.sql

2.3.1 Update_table.sql[edit]

(1) You need the value of $wgDBprefix, and $wgDBname.

  • There may not be a prefix set. Check the value of $wgDBprefix in LocalSettings.
  • The standard wiki db is 'wikidb', but it may be called something else. Check the value of $wgDBname in LocalSettings.php.

Typical settings:

$wgDBprefix = "mw_";
$wgDBname = "wikidb";

But we assume that

$wgDBprefix = "mwsource_";
$wgDBname = "wikimirrordb";

(2) Edit Update_table.sql, to replace /*$wgDBprefix*/ in update_table.sql with the actual value of the prefix (was determined above). You can use

perl -i.bak -pe 's/\/\*\$wgDBprefix\*\//mwsource_/g' update_table.sql

to make the edit (where you need to replace mwsource_ with your prefix.

(3) update_table.sql now needs to be run in the wiki DB. (mediawikiwiki:Extension:OAIRepository notes that this will take a significant amount of time on rather large wikis.) So you run

mysql wikimirrordb -uroot -p < update_table.sql

(where wikidb and username may change depending on your circumstances; the name of the database you can get from $wgDBname as above).

2.3.2 The other three sql files[edit]

You now need to have some tables for the OAI process itself. This can be any db to which the wiki db user has access. We choose the same db as the wikidb, but mediawikiwiki:Extension:OAIRepository has instructions for using a separate db.

Add the following in LocalSettings.php:

$oaiAuditDatabase = 'wikimirrordb'; 

You then need to create three tables for OAI, which is done by these sql scripts: oaiuser_table.sql , oaiharvest_table.sql , oaiaudit_table.sql.

As before (for update_table.sql), replace $wgDBprefix for the actual prefix.

perl -i.bak -pe 's/\/\*\$wgDBprefix\*\//mwsource_/g' oaiuser_table.sql oaiharvest_table.sql oaiaudit_table.sql

to make the edit (where you need to replace mwsource_ with your prefix).

Now create additional tables:

mysql wikimirrordb -uroot -p < oaiaudit_table.sql
mysql wikimirrordb -uroot -p < oaiharvest_table.sql
mysql wikimirrordb -uroot -p < oaiuser_table.sql

(again: wikidb and username may change depending on your circumstances; the name of the database you can get from $wgDBname as above).

2.3.3 Adding a login for the oai user[edit]

To be able to log in to the OAIRepository, you'll have to add a login to the oaiuser table. These don't need to be the same as $wgDBuser and $wgDBpassword, and because they may be passed in the clear, its better to use something else:

Create a file called add_user.sql

INSERT INTO mwsource_oaiuser(ou_name, ou_password_hash) VALUES ('SomeUserName', md5('SomePassword') );

and amend 'SomeUserName' and 'SomePassword'. Then run

mysql wikimirrordb -uroot -p add_user.sql

2.4 Edit OAIRepo_body.php[edit]

As detailed on the page for the extension (here), there's an error in OAIRepo_body.php. A bug report has been filed, and you should check whether this has been resolved. If not you need to make some mannual changes.

Basically, you need to insert $wgDBprefix in three places:

$this->auditTableName( $wgDBprefix . 'oaiuser' ),
$this->auditTableName( $wgDBprefix . 'oaiaudit' ),
$this->mAuditDb = $lb->getConnection( DB_MASTER, $wgDBprefix . 'oaiAudit', $oaiAuditDatabase );

and add

       global $wgDBprefix;

to those functions as well

3 The oai repository on the server[edit]

The OAI repository is now installed on the server, and something like http://www.sciencemedianetwork.org/w/index.php?title=Special:OAIRepository&verb=ListMetadataFormats should now work. (You'll need the username and password to authenticate). You can try these queries:

(This won't on the present wiki, as the extension isn't installed yet.)

Once you have the repository set up, you can do things with it. For instance, mediawikiwiki:Extension:OAIRepository explains how to set up a lucene search. On these pages, we'll set up another wiki as a client.

4 The wiki mirror (oai client)[edit]

The idea is to set up a second wiki, that acts as harvester for the source wiki.

It's not clear how much of the above needs to be repeated for the client wiki, but going through all the steps seems to work.

4.1 Install the extension[edit]

... as above.

4.2 Modify LocalSettings.php[edit]

... as above.

For testing, I installed both wikis on the same server, in the same database, and thus I used a different wgDBprefix:

$wgDBprefix = "mwmirror_";
$wgDBname = "wikimirrordb";

4.3 Do the mysql modifications[edit]

... as above.

4.4 Edit Repo_body[edit]

... probably not necessary, as we won't be using the wiki as a repo.

4.5 Finally[edit]

Add the following lines to LocalSettings.php to enable the harvester:

@include( $IP.'/extensions/OAI/OAIHarvest.php' );
$oaiSourceRepository = "http://url.to.the.source.wiki/wiki/index.php/Special:OAIRepository";

(where the url points to the respository created above).

The client wiki is now pointed at the repository created previously, and we can run

php oaiUpdate.php

on the client. This will return a number of messages about pages to be updated. (However, see issues below!)

5 Issues[edit]

5.1 Authentication[edit]

I couldn't get this to work with authentication. If you comment these lines on client and server, it works.

// $oaiAgentRegex = '/experimental/';
// $oaiAuth = true; # broken... squid? php config? wtf
// $oaiAudit = true;

5.2 Files and images[edit]

When trying to transfer files between wikis (v.1.15), you get:

File updating temporarily broken on 1.11, sorry!

so file/image transfer seems to be broken, due to the change in image handling from 1.11 onwards (and the removal of wfImageDir).

A hack to fix this. In OAIHarvest.php, below

echo "File updating temporarily broken on 1.11, sorry!\n";

insert

$image = Image::newFromTitle($upload['filename']);	
$new_image = preg_replace("/^.*?images\//","",	$upload["src"]);
$new_image = $image->repo->directory . "/" . $new_image;
$filename = $new_image;

This relies on the source wiki using /images/ as upload directory. A better solution would probably be to enter this into the configuration.

There's also an issue with caching:

  • Add image link to source
  • Run harvester on client
  • Add actual image to source
  • Run harvester again

In that case, the page on the client need to be purged for the image to show. If the client is low traffic, caching could be turned off to work around this.