Hacking Anthologize PDF

With 0.6-alpha newly released, and some various inquiries from interested Anthologizers, it's probably about time to start thinking more about the output formats and how to customize them. The theory, after all, is that output formats are meant to be as hackable as a WordPress theme. Clearly, though, there are additional layers of complication since we're talking about multiple formats, not just HTML, and each of those formats carries its own quirks. PDF output, for example, carries a heaping helping of complications that HTML handles very differently (nowadays, in part, by the combination of HTML and CSS). But the principle remains the same. We want you to hack on and create new output formats, tailored to specific rhetorical situations, just as WP themes have done.

So, as a first step toward that goal, lets hack the existing PDF output.

The Core Sequence

First thing to get a handle on is the basic sequence of how your Anthologize project becomes a PDF.

The first step that happens for you is that the project is put into a (mostly) TEI document. We've added a few of our own namespaced elements, mostly to carry the options that you check on the export screens, but overall, it'll be recognizable as TEI. That's what you'll see if you export to Anthologize TEI.

But you need not be a TEI guru to work with things. We're working to build up an API to make life easier for you to put together the content you want. We're also developing an abstract class to help guide you through the process.

With that background, here's what's happening when you hit export to PDF. The next file that gets invoked is base.php:

$ops = array('includeStructuredSubjects' => false, //Include structured data about tags and categories
                'includeItemSubjects' => false, // Include basic data about tags and categories
                'includeCreatorData' => false, // Include basic data about creators
                'includeStructuredCreatorData' => false, //include structured data about creators
                'includeOriginalPostData' => true, //include data about the original post (true to use tags and categories)
                'checkImgSrcs' => true, //whether to check availability of image sources
                'linkToEmbeddedObjects' => true,
                'indexSubjects' => false,
                'indexCategories' => false,
                'indexTags' => false,
                'indexAuthors' => false,
                'indexImages' => false,
                );

$tei = new TeiDom($_SESSION, $ops);
$api = new TeiApi($tei);
$pdfer = new PdfAnthologizer($api);

$pdfer->output();

Skip over the $ops array for now, and you can see the sequence outlined above. First, an instance of TeiDom builds the Anthologize TEI. Then, pass that in to a new instance of TeiApi, and you have more tools to grab the content you want. PdfAnthologizer is the class that makes it easy to roll through the content and put it into the right place in the PDF. It extends a class called Anthologzer (includes/class-anthologizer.php), which is the one that's meant to help guide you through the process.

Anthologizer Class, and a boring extension of it

Let's start there, with Anthologizer and work backwards. PdfAnthologizer extends it. The Anthologizer class is designed to give you methods to override or instantiate to help you roll through the content. You can see that in its $output property, the constructor, and the protected method writeItemContent.

public $output;

function __construct($api) {
        $this->api = $api;
        $this->init();
        $this->appendFront();
        $this->appendBody();
        $this->appendBack();
        $this->finish();
}

protected function writeItemContent($section, $partNo, $itemNo) {
        //override this method to do additional processing
        $html = $this->api->getSectionPartItemContent($section, $partNo, $itemNo);
        return $html;
}

All the other methods there are abstract. And so, a very simple -- and boring -- output that does nothing but output the first item in the first part of your project would look like:

class BoringAnthologizer extends Anthologizer {

  public function init() {
    //nothing to initialize before I start building the output
  }

  public function appendFront() {
    //I'm skipping the front matter. Currently, this is the Title Page, plus the Dedication and Acknowledgements from the export screens
  }

  public function appendBody() {
    //Parts and Items are 0-base indexed
    $output = $this->writeItemContent('body', 0, 0);
  }  

  public function appendBack() {
    //Skipping back matter, which might be indexes or other information auto-generated from the Anthologize TEI
  }

  public function finish() {
    //In this case, there's no post processing to do on the output.
  }

  public function output() {
    echo $this->output;
  }
}

Notice that this does not actually produce good HTML -- there's nothing that makes a <html>, <head>, or any other tag aside from what's in the HTML drawn from WordPress. It just pushes out the HTML that's in WordPress when you call ouput() on your instance of BoringAnthologizer. Adding the full HTML structure, here, would be jobs for init() and finish().

PDF output is clearly much more funky. Anthologize ships with the TCPDF library for creating PDFs. Thus, the $output property is an instance of TCPDF (actually, a subclass of TCPDF from that library), with lots of the configuration of it done in init(). Building the Table of Contents with TCPDF can naturally only happen once the document is built, so that happens in finish(). Other than that, we're rolling through the append* methods to stuff data in, using TCPDF methods. Mostly.

Quick Recipes for Hacking PdfAnthologizer

Here's recipes for two quick little hacks on the PDF output. You could do these either by directly changing the PdfAnthologizer class, or subclassing it. If you go the subclassing route, make sure you change base.php to use your new class:

$pdfer = new MyPdfAnthologizer($api);

First, let's change the logo that appears in the header on pages from the Anthologize logo to your logo.

This is easy, since PdfAnthologizer just slaps that into additional properties:

public $headerLogo = 'logo-pdf.png'; //should be in /anthologize/images
public $headerLogoWidth = '10';

Change the filename for $headerLogo, make sure it's in /anthologize/images, and that's it! You'll want to tinker with $headerLogoWidth to make it conform to your standards of prettiness of design. Fortunately for me, my standards along those lines are pretty weak, so I just make two or three guesses and don't worry about it more than that.

Want to have more fun with the header? You can do that by looking a bit more closely into the appendPart() and appendItem() methods. The thing to look for is an additional method in PdfAnthologizer called set_headers(). PdfAnthologizer makes the headers by showing the part title and the item title for each page. To do that appendPart() digs up the info and calls set_headers:

$titleNode = $this->api->getSectionPartTitle($section, $partNo, true);
$title = isset( $titleNode->textContent ) ? $titleNode->textContent : '';

$firstItemNode = $this->api->getSectionPartItemTitle($section, $partNo, 0, true);
$string = isset( $firstItemNode->textContent ) ? $firstItemNode->textContent : false;
   
$this->set_header(array('title'=>$title, 'string'=>$string));

In the array passed to set_header, TCPDF understands 'title' as the top line and 'string' as the bottom line (but more on that later).

appendItem() similarly calls set_header() to reset the 'string':

$titleNode = $this->api->getSectionPartItemTitle($section, $partNo, $itemNo, true);
$title = isset( $titleNode->textContent ) ? $titleNode->textContent : '';
$this->set_header(array('string'=>$title));

And so, you have many hacking options available. Just hack and modify the text at will (We'll see more options for that when we move into the TeiApi possibilities).

Overriding TCPDF

While we're on the subject of the header, this is a good time to look at how and why Anthologize overrides TCPDF with /pdf/class-anthologize-tcpdf.php.

That class overrides three methods of TCPDF: addTOC() (PDF, PHP), Header() (PDF, PHP), and Footer() (PDF, PHP). It does so in a fairly lazy way, mostly copying directly from the original methods and making a few minor adjustments. I'm still fairly weak with the mysteries of TCPDF, to be honest, so there's a lot there that I don't understand, but tinkering and working from TCPDF's own examples is a great place to start.

For a quick stab into the guts, here's the lines there that set the 'title' and 'string':

$headerdata = $this->getHeaderData();

//snip

// header title
$this->SetFont($headerfont[0], 'B', $headerfont[2] + 1);
$this->SetX($header_x);
$this->Cell($cw, $cell_height, $headerdata['title'], 0, 1, '', 0, '', 0);

// header string
$this->SetFont($headerfont[0], $headerfont[1], $headerfont[2]);
$this->SetX($header_x);
$this->MultiCell($cw, $cell_height, $headerdata['string'], 0, '', 0, 1, '', '', true, 0, false, true, 0, 'T', false);

Yeah, clearly there are many variables and parameters running around. But the key idea is that overriding TCPDF to make your own Header template is built into TCPDF's design. The set_header() method in PdfAnthologizer discussed above invokes TCPDF's setHeaderData() method, so this is an extra layer in how you can work with the header.

Similarly for the Footer and Table Of Contents. Anthologize overrides the Footer just to include only the current page number. TCPDF by default displays the "page X / total pages" format. If that's what you prefer, just erase the Footer method in AnthologizeTCPDF, or treat it to your taste in your own override of TCPDF.

And that leads us back to the PdfAnthologizer class. Remember, that's the one with init(), appendPart(), appendItem(), and other methods designed to help you roll through the content. We were hacking on that one above when we looked at set_header(). Our BoringAnthologizer class would be a sibling of it (both extend Anthologizer). We've looked briefly at the appendPart() and appendItem() methods. In PdfAnthologizer's init() method, notice how we start up the PDF generation process:

$this->output = new AnthologizeTCPDF(PDF_PAGE_ORIENTATION, PDF_UNIT, $page_size, true, 'UTF-8', false); //overrides TCPDF

So, if you want to do more extreme redesigning of things like the Header, Footer, or Table of Contents for your PDF, you're working in TCPDF-land, overriding TCPDF to your taste. Then, in your hack or override to PdfAnthologizer, you'd make sure that that's what you fire up in its init() method.

General rules for understanding how to hack or override PdfAnthologizer

We're not quite as loopy as WordPress, but we are kinda loopy

When PdfAnthologizer (or any other class that extends Anthologizer) starts up, it does the following, taking a TeiDom object as its construct parameter

  1. Initialize output, which is a public property. See init()
  2. Append front matter. It's up to you to instantiate how that works. See appendFront()
  3. Append the body, the content of your project. Usually loops through Parts in your instantiation. See appendBody()
    1. Append Parts. Again up to you to instantiate, but usually loops through Items. See appendPart()
    2. Append Items. Another day, another loop. Usually uses writeItemContent() to get the HTML. See appendItem()
  4. Append the back. In theory will follow the same pattern, but not yet well developed in Anthologize
  5. Finalize anything that needs to be wrapped up

The usual pattern to watch for in the append* methods

  1. Grab data from $this->api
  2. Modify the data to taste
  3. Add that data into $this->output
  4. Add whatever additonal content you want

So, for example, some document metadata for the pdf is created by these lines in addFront():

//title and author
$creator = $this->api->getProjectCreator(false, false);
$book_title = $this->api->getProjectTitle(true);
$this->output->SetCreator("Anthologize: A One Week | One Tool Production");
$this->output->SetAuthor($creator);
$this->output->SetTitle($book_title);

And these lines add the actual title page:

//append cover
$this->output->AddPage();

$this->frontPages++; //keeps track of pages in the front matter to remove them from the page numbering in the TOC

$this->output->SetY(80);
$this->output->Write('', $book_title, '', false, 'C', true );
$this->output->setFont($this->font_family, '', $this->baseH);

$this->output->Write('', $creator, '', false, 'C', true );
$this->output->SetY(120);
$year = substr( $this->api->getProjectPublicationDate(), 0, 4 );
$this->output->Write('', $this->api->getProjectCopyright(false, false) . ' -- ' . $year , '', false, 'C', true );
$this->output->EndPage();

See the pattern? This is an instance of PdfAnthologizer. Grab $creator and $book_title from $this->api. $this->output is an instance of TCPDF (overridden by AnthologizeTCPDF), so use the methods of that class to add to the output.

Grabbing Content with the TeiApi Class

So about that TeiApi class. There's a lot going on, and it is in flux, and there are many quirks due to how complicated things like authorship are in this environment. Ask your friendly neighborhood metadata librarian about the different varieties of "creator" floating around a complex document like this, especially with the twist of some creators being users in the WordPress site, and some not (e.g., in the case of importing content via RSS feed).

We'll be sorting through and developing that Api, hopefully with much feedback from people like you who want to hack on the export formats or create your own. A set of real unit tests and documentation are part of our plans.

Add comment

"Any medium powerful enough to extend man's reach is powerful enough to topple his world. To get the medium's magic to work for one's aims rather than against them is to attain literacy."
-- Alan Kay, "Computer Software", Scientific American, September 1984

Search form

Info about apps mentioned

© Patrick Murray-John. All content is CC-BY. Drupal theme by Kiwi Themes.