MWM/Getting SpecialExport data with LWP

From Bjoern Hassler's website
< MWM
Jump to: navigation, search

This fetches Special:Export pages with perl/LWP.

1 Working out which pages to get[edit]

If you have a list of pages, you're done.

1.1 By Category[edit]

The UNESCO OER wiki is not up to date, so category pages cannot be retrieved via the API, but need to be supplied manually. One option is to use perl/mechanize to click the 'Add Pages in Category' button, and then to retrieve. I might add a recipe for this.

With more recent installs of mediawiki, a list of pages within a category can be determined via the api:

api.php?action=query&list=categorymembers&cmtitle=Category:Access2OER

This is easily accomplished using MediaWiki::API:

[ View code | Edit code | Download]
use MediaWiki::API;

my $mw = MediaWiki::API->new();
$mw->{config}->{api_url} = 'http://.../api.php';

$mw->{config}->{on_error} = \&on_error;

sub on_error {
    print "Error code: " . $mw->{error}->{code} . "\n";
    print $mw->{error}->{stacktrace}."\n";
    die;
};

# get a list of articles in category                                                                                                                                                                    
my $articles = $mw->list ( {
    action => 'query',
    list => 'categorymembers',
    cmtitle => 'Category:Access2OER',
    cmlimit => 'max' } )
    || die $mw->{error}->{code} . ': ' . $mw->{error}->{details};

# and print the article titles                                                                                                                                                                          
foreach (@{$articles}) {
    print "$_->{title}\n";
}

1.2 Page with dependencies[edit]

You can determine the templates in use on a particular page as follows:

api.php?action=query&prop=templates&titles=Main%20Page

1.3 Subpages[edit]

It's possible to determine subpages using the api with apprefix. E.g. get all pages starting with 'Tutorials/' (i.e. proper subpages on Tutorials):

action=query&list=allpages&aplimit=100&apprefix=Tutorials/

You'd also need to add the 'Tutorials' page itself to the list. The above query won't catch the 'Tutorials' page itself.

2 Getting the pages[edit]

When you have your list of pages, the following script gets them:

[ View code | Edit code | Download]
#!/path/to/perl
use strict;
use LWP::UserAgent;
use HTTP::Request::Common;
 
my $myurl = "http://oerwiki.iiep-unesco.org/index.php?title=Special:Export";
my $pages;

while (<STDIN>) {
    $pages .= $_;
};
 
my %formfields = (
    "pages" => $pages,
    "curonly" => "true",
    "action" => "submit",
    "submit" => "Export"
    );
 
my $ua = new LWP::UserAgent;
 
$ua->protocols_allowed( [ 'http'] );
my $page = $ua->request(POST $myurl,\%formfields);
 
(my $date = `date`)=~ s/[\/\n]//g;

if ($page->is_success) {
    open F,">Special Export $category $date.xml";
    print F $page->content;
    close F;
    print "Done.\n";
} else {
    print $page->message;
}