Thursday, August 19, 2010

How do you know your PDF is correct? (cont.)

I guess over time I have come to adopt the PDF correctness model that says "if the PDF is generally correct, i.e., works in a variety of applications across versions, operating systems, and processors and it works correctly for a given application then we say the PDF is in fact correct."

There are a few PDF validating tools out there you can buy.  Some large printing houses have them to validate their internal workflow or external PDF inputs (like ads in a newspaper).  But these are of little use if the PDF that comes out doesn't work or if you don't have the money to buy one.

We have used this type of model for more than a decade and it has always worked well.

Periodically we come across new vendors or vendors updating their PDF output and we run into interesting problems.  Recently a customer came to me quite panicked and said "some of my output is missing".

This is the second cardinal sin of workflow - having an application make some output that is supposed to be on the page go away.  (The first cardinal sin is having the wrong output on the page.)

So I took a look at the PDF.  All PDFs have an "Info" section where you have a string containing the authoring application's name, a time stamp, etc.  Well, lo and behold, way down deep in the resources for the page in question there was a color space definition.  (A color space definition says things like "CS1" means to use calibrated RGB for all colors marked by the CS1 tag.  Well here was the name "CS1" and its definition, instead of legal PDF color space, was a link to the Info section of the PDF document.

Our application, in this case pdfExpress XM, iterates through the color spaces because it might want to change them.  As it does so it checks to see if each color space is one it cares about.  Part of this checking is to check whether or not it conforms the PDF standard for color spaces.  When if found this particular color space definition it generated an error.  Unfortunately several levels up in the code we made the assumption that the return value was always correct and we placed the page contents based on this.  When this error occurred we branched around the code that injected content into the page and so that part of the page was blank.

So in this particular case its easy to say "the PDF is wrong" - which it is.  But our handling of the bad PDF was also a problem and the customer was unhappy. 


The real question here is as a PDF application developer how do you anticipate arbitrarily wrong PDF input.

The portion of the PDF in question is basically a dictionary (as we discussed a few days ago) where the entry in the dictionary for the value is incorrect.   This is many levels down in a structure which is otherwise correct.  While its fairly easy to check for what should be there handling the cases where something that's not supposed to be there is not.  For example, most places in PDF including dictionary entries, can be indirect.  This means that instead of an actual value being present there is a pointer to some object.  Normally you have to locate this object and then examine it as if it were the entry.

No comments:

Post a Comment