Tuesday, August 17, 2010

How do you know your PDF is correct?

This question comes up from time to time.  As far as I know there is not programmer or application which can tell you if your PDF is in fact "correct" PDF.

Before we get into that let's look into what "correct" means in this context.  PDF is both a language and a structured file format.  Unlike PostScript, which is also a language, PDF cannot be extended.  PostScript, based on the Forth programming languages, allows you to define new language constructs and make use of them in your documents.

There have been all sorts of problems with PostScript in this regard - at least from the perspective of someone trying to make use of the PostScript content of documents.  It was easy to create convoluted, buggy ways to do things and one could never be sure if the constructs created really worked right.  A big issue with this was "page independence".  Programmers used to create PostScript files that, like programs, had to be executed sequentially, i.e., page by page, in order for the PostScript programming to work right.

PDF was meant to solve this - which it did by creating a language that was not could not be extended.  However, it also opened another can of worms. 

The structure of a PDF file is organized loosely around that of a hierarchical database.  At the top of the tree you have the "Root".  Below the root you have indexes of pages.  Page have elements like Resources.  Resources have things like fonts and images.  Most of the important parts of the file are organized around "dictionaries".

A dictionary is a structure with keys and values:

  << /EntryA 1
       /EntryB 2
  >>

In this case there are two keys (EntryA and EntryB) and two values (1 and 2).  The keys are used to retrieve the values - so I can lookup "EntryA" and get the value 1.

Using simple dictionaries doesn't present a problem.  But PDF uses dictionaries to hold other dictionaries, arrays, and other complex entities.  In addition PDF allows two dictionaries to share the same value.  So, for example, if a font is used on every page I don't have to duplicate the definition of the font; I merely store a link to a common definition in the dictionary.

So what does all this have to do with defining "correctness?"  While there are certainly many documents that define PDF a lot that has gone on over the years has left legacy issues open, i.e., things were done a certain way early on and never changed to reflect changes to PDF.  Another big issue is that things like dictionaries in PDF have defined entries, i.e., in a Resource dictionary you have a /Font entry and that's where fonts for the page are found.  But PDF in general doesn't say anything about putting other things into these structures, e.g., I can add an Elephants entry in the Resource directory and it will be ignored by most applications because they are not looking for it and don't care about it.

(Note: This is not an issue in file formats like AFP which are not based on a structured dictionary type of model.)

So what does it mean for a PDF to be correct?  The answer is I am not sure.  We always use a fairly complex test to determine "correctness".

No comments:

Post a Comment