Blog/20131118 Creating pdf for OER4Schools

From Bjoern Hassler's website
Jump to: navigation, search

More from B's blog:

Some older entries are here.

Creating pdf for OER4Schools

We're going to ship some copies of our resource out to Zambia, and so we need to print the resource from the wiki. The collection extension doesn't print our boxes well, so we had been printing each page separately, and then collating it. That was a right pain. We need something more automatic.

In the last blog post I talked about how to create an offline copy of our wiki. So let's download all the pages that we want to print as html. We have to do something like this for each 'page':

   wget -q -O /dev/null 'http://orbit.educ.cam.ac.uk/w/index.php?action=purge&title=OER4Schools/page'
   wget -q -O /dev/null 'http://orbit.educ.cam.ac.uk/orbit_mirror/index.php?page=OER4Schools/page&purgepage=yes'
   wget -q -O - http://orbit.educ.cam.ac.uk/orbit_mirror/site/page.html > page.html

This firstly purges the cache on the wiki, then purges the cache on the mirror, and then downloads the page. The resulting oer4schools_complete.html file displays ok in a browser, but with a few replacements, it looks even better. Here are some of the things that I fixed:

  • remove the main menu (which is only useful for web);
  • removed other navigation only needed for web;
  • changed the levels of the headings to generate proper chapter headings;
  • adjust heading title styles;
  • replace the video / audio players by messages on how to find the video / audio.

(Of course you could also get those pages via the print view of the wiki.)

Now that we have the html, let's generate the pdf from the html with wkpdf. Let's install

sudo apt-get install rubygems
sudo gem install wkpdf

and then fetch the pdf files:

wkpdf -s http://orbit.educ.cam.ac.uk/orbit_mirror/page.html -o page.pdf

This results in nice set of pdf files, one per wiki page. We are getting somewhere! We now need to combine these, but before we do this, we need to think about table of contents and page numbers.

So let's first see how many pages each page.pdf has. Let's use pdftk:

sudo apt-get install pdftk

and then this gets us total page numbers:

pdftk page.pdf dump_data

So now we've got the length of each file in terms of pages, we can make a toc, initially a text file, and then convert to toc.pdf e.g. using enscript.

So now page numbers, i.e. we need to print the pages numbers onto each page.pdf file. One way this is done through creating pages with just the numbers, and then superimpose them. Again, often enscript is recommended for that, i.e. pipe blank lines (one per page needed)

(one blank line per page in page.pdf) | enscript -L1 --header='$title||$i' --output - | ps2pdf - page.header.pdf

and then combine page.pdf with page.header.pdf like this:

pdftk page.pdf multistamp page.header.pdf output numbered_page.pdf

So now that we have our pages with page numbers, we go ahead an combine these:

pdftk titlepage.pdf toc.pdf numbered_page_1.pdf ... numbered_page_N.pdf backpage.pdf cat output Final_document.pdf

where we've added the toc.pdf we created earlier, as well as a titlepage.pdf and backpage.pdf.

Not the most straight forward process. Some issues were that I couldn't get the margin settings for wkpdf to work, nor the --footer setting for enscript. Also wkpdf (while it does a good job at rendering) occasionally a line "bleeds" across two pages, but that's could be an issue with the underlying webkit, rather than wkpdf. However, apart from that, it sort of all worked out!

Addendum: The lines bleeding across pdf pages with wkpdf happened much less frequently when we fixed some page properties through css, e.g. height/width and font size.

Addendum 2: So a few changes. I am now generating pages straight from the print view on mediawiki. That works ok, and it means that any tweaks to MediaWiki:Print.css are also availabel to people printing the page otherwise, so it's good.

  • The disadvantage is that I can no longer adjust the section numbering to include the session number.

I had another go with phantomjs, with an adapted version of rasterize.js. Turns out there's a bug in phantomjs on OS X, which means that pages weren't rendered to pdf, but to huge images. So I'd initially discarded this. However, on Raspberry Pi it works.

  • It has the same (very rare) cut-off lines issue as wkpdf, so it's probably nothing to do with those tools, but the rendering framework.
  • Like for wkpdf, setting the font-size and page width was necessary to get teh same font sizes across pages.
  • Because of the use of the way phantomjs works with javascript, it's possible to add header/footer: No more use of enscript to do generate pages, and no need to merge those title pages with the actual pages any more.
  • The biggest problems are:
    • The links do not come out as links, which is a pain, because it makes the document hard to use as a pdf (rather than just for print). This seems to be a bug in phantomjs.
    • The fonts don't work for some reason. All fonts are rendered as the default font, which does not look pretty. So that still needs working out.



2013-11-18 | Leave a comment | Back to blog Share on Twitter Share on Facebook