The PDF Outsider: August 2010

Monday, August 30, 2010

Off topic at Lone Wolf and some interesting history...

Post at Lone Wolf.

From a historical perspective readers might find this interesting: Argon.

Thursday, August 26, 2010

New post...

Thursday, August 19, 2010

How do you know your PDF is correct? (cont.)

I guess over time I have come to adopt the PDF correctness model that says "if the PDF is generally correct, i.e., works in a variety of applications across versions, operating systems, and processors and it works correctly for a given application then we say the PDF is in fact correct."

There are a few PDF validating tools out there you can buy. Some large printing houses have them to validate their internal workflow or external PDF inputs (like ads in a newspaper). But these are of little use if the PDF that comes out doesn't work or if you don't have the money to buy one.

We have used this type of model for more than a decade and it has always worked well.

Periodically we come across new vendors or vendors updating their PDF output and we run into interesting problems. Recently a customer came to me quite panicked and said "some of my output is missing".

This is the second cardinal sin of workflow - having an application make some output that is supposed to be on the page go away. (The first cardinal sin is having the wrong output on the page.)

So I took a look at the PDF. All PDFs have an "Info" section where you have a string containing the authoring application's name, a time stamp, etc. Well, lo and behold, way down deep in the resources for the page in question there was a color space definition. (A color space definition says things like "CS1" means to use calibrated RGB for all colors marked by the CS1 tag. Well here was the name "CS1" and its definition, instead of legal PDF color space, was a link to the Info section of the PDF document.

Our application, in this case pdfExpress XM, iterates through the color spaces because it might want to change them. As it does so it checks to see if each color space is one it cares about. Part of this checking is to check whether or not it conforms the PDF standard for color spaces. When if found this particular color space definition it generated an error. Unfortunately several levels up in the code we made the assumption that the return value was always correct and we placed the page contents based on this. When this error occurred we branched around the code that injected content into the page and so that part of the page was blank.

So in this particular case its easy to say "the PDF is wrong" - which it is. But our handling of the bad PDF was also a problem and the customer was unhappy.

The real question here is as a PDF application developer how do you anticipate arbitrarily wrong PDF input.

The portion of the PDF in question is basically a dictionary (as we discussed a few days ago) where the entry in the dictionary for the value is incorrect. This is many levels down in a structure which is otherwise correct. While its fairly easy to check for what should be there handling the cases where something that's not supposed to be there is not. For example, most places in PDF including dictionary entries, can be indirect. This means that instead of an actual value being present there is a pointer to some object. Normally you have to locate this object and then examine it as if it were the entry.

Wednesday, August 18, 2010

How do you know your PDF is correct? (cont.)

While there are tools available to purchase is this regard their use does not ensure that a PDF will work in practice. Each version of PDF has specific features, e.g., transparency, that it enables, has specific features that are replaced with alternate constructs, and so on.

So the first thing you have to figure out is which version are you trying to create. In my experience you always want to check that the PDF is compatible with the oldest version that supports all the features you need. You can check for correctness relative to new versions as well but this limits the usefulness of the PDF. Of course, "A" list companies always want to force your PDFs to be the latest, most complex version - but that's not always the best for you or your customer or your application.

Once you have decided on a PDF version the simplest way to "validate" a PDF is to use a variety applications to process it and see if the results are correct. For RIPs and viewers this basically means processing the PDF and checking the output and logs. We tend to use a spectrum of newer and older tools, RIPs and applications for this. The reasoning is that if it opens and works in older tools as well as newer tools the PDF is much more likely to be right.

Our tools have been in operation and continuous customer use for almost a decade at this point so we only contend with new "features" for the most part.

Backward compatibility is also important and in general we tend not to add features just because we can. Why? Mostly because the customers who use our products have their own set of tools which they have been using a long time and don't want to have to re-verify that any changes we have made to not negatively impact their workflow.

When you think about all of this together you start to see that there really isn't such a thing as "correct PDF" because that depends on the application and usage. I can continually update my PDF output modules but I may break customer workflows by doing so.

Tuesday, August 17, 2010

How do you know your PDF is correct?

This question comes up from time to time. As far as I know there is not programmer or application which can tell you if your PDF is in fact "correct" PDF.

Before we get into that let's look into what "correct" means in this context. PDF is both a language and a structured file format. Unlike PostScript, which is also a language, PDF cannot be extended. PostScript, based on the Forth programming languages, allows you to define new language constructs and make use of them in your documents.

There have been all sorts of problems with PostScript in this regard - at least from the perspective of someone trying to make use of the PostScript content of documents. It was easy to create convoluted, buggy ways to do things and one could never be sure if the constructs created really worked right. A big issue with this was "page independence". Programmers used to create PostScript files that, like programs, had to be executed sequentially, i.e., page by page, in order for the PostScript programming to work right.

PDF was meant to solve this - which it did by creating a language that was not could not be extended. However, it also opened another can of worms.

The structure of a PDF file is organized loosely around that of a hierarchical database. At the top of the tree you have the "Root". Below the root you have indexes of pages. Page have elements like Resources. Resources have things like fonts and images. Most of the important parts of the file are organized around "dictionaries".

A dictionary is a structure with keys and values:

<< /EntryA 1

/EntryB 2

In this case there are two keys (EntryA and EntryB) and two values (1 and 2). The keys are used to retrieve the values - so I can lookup "EntryA" and get the value 1.

Using simple dictionaries doesn't present a problem. But PDF uses dictionaries to hold other dictionaries, arrays, and other complex entities. In addition PDF allows two dictionaries to share the same value. So, for example, if a font is used on every page I don't have to duplicate the definition of the font; I merely store a link to a common definition in the dictionary.

So what does all this have to do with defining "correctness?" While there are certainly many documents that define PDF a lot that has gone on over the years has left legacy issues open, i.e., things were done a certain way early on and never changed to reflect changes to PDF. Another big issue is that things like dictionaries in PDF have defined entries, i.e., in a Resource dictionary you have a /Font entry and that's where fonts for the page are found. But PDF in general doesn't say anything about putting other things into these structures, e.g., I can add an Elephants entry in the Resource directory and it will be ignored by most applications because they are not looking for it and don't care about it.

(Note: This is not an issue in file formats like AFP which are not based on a structured dictionary type of model.)

So what does it mean for a PDF to be correct? The answer is I am not sure. We always use a fairly complex test to determine "correctness".

Monday, August 16, 2010

Data Driven Color Debugging... (part 3)

The final part is at the AFP Outsider.

Tomorrow we will return to this blog for a few days.

Friday, August 13, 2010

Debugging Color...

Part two is at the AFP Outsider.

Thursday, August 12, 2010

Debugging Color...

I've had some interesting experiences in this regard - particularly with those unskilled in color.

First off - what do I mean by "debugging color"? In large complex systems color failures do not occur on a screen where a message box pops up and says "sorry your job failed". In these system things are put together in such a way as to work 24x7. They process standard inputs and standard outputs - all tested and carefully organized. So a "failure" in this context occurs only when someone actually notices the problem.

What do I mean by this?

Large systems are typically built to be, at least to a certain extent, fault tolerant in the sense that at various steps along the way checks are performed to ensure what is supposed to be going on is actually going on. Jobs that have various incorrect elements "pop out" in fail folders and humans collect them, diagnose them, and resubmit them for processing (presumably after fixing the problems).

Color problems do not manifest themselves in quite the same way as, say, a missing database entry for a mail merge job. In this later case a logical integrity check says - "hmmm - there is no entry for this" and the job stops processing.

If the "color is wrong" what happens? First off - there are no systems along the way to check. Maybe that logo is now red and not blue. The software cannot not know this because there is nothing to check it against. Secondly, something - a human or a machine - has to look at the result of processing the entire job - usually by looking at what's in an output bin - before the problem is identified. So, after maybe two to four hours of processing through five different servers and ten different processes we discover something is wrong.

So the night operator, whose job it is is to run the jobs nightly, notices that something doesn't look like its supposed to. In the production world people are trained to notice things that don't look the same as everything else. In this case the job doesn't look like it did yesterday (or, if you're lucky there's a sample to compare it too). The logo is a different color or shade.

So at this point the operator probably does not know what is wrong - if anything. The system that provides working job samples for comparison might be broken, i.e, the operator did not get updated production sample, or the output might be wrong, or the device producing it might be working incorrectly.

Generally an operator will be able to identify problems related to things under their control, e.g., are other jobs producing wrong colors. So generally the problem does not escalate unless these "standard" sort of issues get ruled out.

So now the problem has "escalated" to the supervisor. At this point generally things become more interesting. Someone has to decide if the output is actually "wrong" or if the criteria to judge it is incorrect. As remarkable as it seems often no one is able to make a precise diagnosis.

The reasons are very interesting. Typically in a large shop the world of operations, job preparation, QA, graphic arts, and customer service all fall into different silos. Each silo has its own managers, contractors, vendors, processes, and so forth.

Now the job has to "work backwards" through the system. At each step those involved (from whatever silo is responsible) have to look at what inputs they received and compare them to the incorrect output. In the world of color things get more interesting than in the case of, say, data value, i.e., a bad mailing address.

Each silo has its own idea of how color is handled and measured. Some don't really know anything about it other than how it "looks". Others tend, like CSRs, tend to think about it in terms of "approved color" or "not approved color". Programmers tend to think about the numbers in the code that produced the values.

So as the job works backwards each silo applies its own criteria to the output and decides if its responsible for what it sees. Generally this occurs in an ad hoc meeting on the shop floor where fingers are pointed.

(This will be continued tomorrow on the AFP Outsider blog as it applies to both PDF and AFP workflows.)

Wednesday, August 11, 2010

PDF Size and Performance...

This is a topic which comes up frequently (and no, this is not Viagra spam).

People say "This PDF is too large and it will take to long to RIP." Basically most people make a direct link between the size of the PDF and the performance they are going to get RIPing the file.

In general this is wrong for several reasons. First of all, until very recently PDF was always turned into PostScript before rasterizing on a RIP. Now PostScript is a programming language which means that the RIP must process the language statements in order to create the raster. All this takes time - especially when you have a lot of nested PDF forms. So any PDF file would be effectively processed twice - once to parse the PDF to PostScript, again to process the PostScript.

There isn't a one-to-one correspondence between PDF operators and PostScript operators, particularly in terms of complexity, so seemingly simple and short PDF might not be simple PostScript as far as the rasterizer is concerned.

PDFs can be very large due to embedded images. The most profound effect I have seen on performance (and I mostly work on non-plating devices) is extra resolution, i.e., too many bits for the job. Shear volume is the first problem - a 2,400 dpi CMYK image takes a long time for the RIP to consume because there are a lot of bytes. If you only need 600 dpi then don't put a 2,400 dpi image into the file. RIPs process images quickly but can be overwhelmed by sheer volume.

So even though there are lots of images the file may still RIP quickly.

Font usage is a weakness in many RIPs - particularly using a lot of fonts. There are many cases of PDFs having a completely new set of fonts on each page in, say, a thousand page document. RIPs don't deal well with this and for me its been a problem over the years. This issue compounds another common font issue - bitmap fonts. Many programs, particularly those that convert to PDF, tend to convert source bitmap fonts to PDF bitmap fonts. My experience is that the higher the percentage of bitmap fonts in a file the slower, in general, it will RIP as page count goes up.

Applications go to great lengths to obfuscate fonts so that font licenses can't have their intellectual property stolen. Unfortunately you may be paying the price for this with crappy RIP performance - so you're paying twice - once for the font if you need it and again to RIP hundreds or thousands of copies to prevent you from stealing it.

The last problem area is transparency. There are two types - masks and true PDF transparency - and both create performance issues. (Most of what I will say here is more or less invisible to a use because "A" list applications try very hard to "hide" the fact that transparency is being used.) Basically any situation in a PDF file where you can "see through" a transparent part to some underlying PDF element is a problem. For transparency masks, which can occur in any PDF, the problem increases as the size of mask increases. For true PDF transparency (controlled by the graphic state) any use is a problem for performance.

The issue is simple - a printer cannot render "transparent" ink. So, if a blue line crosses over a black line and we want to see some of the black line through the blue one the RIP has to calculate what colors the effect of the transparency would have and print a gray-blue color to present that effect. The calculation requires the RIP to rasterize the transparent areas, create colors for them, then merge that rasterization result with the result of rasterizing everything else.

The bottom line is that transparent areas are rasterized twice - which slows things down.

So very large PDF files, as long as they avoid these and other pitfalls, will RIP very fast. At the same time, very small files using many of these bad constructs, will RIP slowly.

Tuesday, August 10, 2010

Imposition Aside...

I got involved in another PDF project the other day... This involves imposition. However, there were a few catches.

First off, the imposition itself was fairly simply - single sheet front and back. Unfortunately the imposition has to occur over a stream of front/back sheets, i.e., the underlying sheets change. Secondly pages to impose are chosen not by sequence but by bookmark - and bookmarks can also have an indicator which says to leave that position blank.

So basically you have a stream of documents to impose:

DOCID_01 DOCID_03 DOCID_99 DOCID_12

And so on. The blanks can appear like this:

DOCID_01 blank DOCID_03 DOCID_99 DOCID_12 blank

where "blank" is a special document ID meaning skip document pages here (pages per document is constant over the run).

Other than that we're basically dealing with cell positions for each page of the document.

I used our pdfExpressXM software as a platform for this. It supports the multi-stream imposition (though I had to fix bug relating to relocating resources). Internally the imposition model looks like this:

    item2={source2=+1,clip={0,0,792,612},ctm=    { 0, -1, 1, 0, 41.04,     1431 }}
    item4={source2=+3,clip={0,0,792,612},ctm=    { 0, -1, 1, 0, 693.36,    1431 }}
    item6={source2=+5,clip={0,0,792,612},ctm=    { 0, -1, 1, 0, 40.04,     1143 }}
    item8={source2=+7,clip={0,0,792,612},ctm=    { 0, -1, 1, 0, 693.364,   1143 }}
    item10={source2=+9,clip={0,0,792,612},ctm=   { 0, -1, 1, 0, 40.04,      855 }}
    item12={source2=+11,clip={0,0,792,612},ctm= { 0, -1, 1, 0, 693.364,    855 }}
    item14={source2=+13,clip={0,0,792,612},ctm= { 0, -1, 1, 0, 40.04,      567 }}
    item16={source2=+15,clip={0,0,792,612},ctm= { 0, -1, 1, 0, 693.364,    567 }}
    item18={source2=+17,clip={0,0,792,612},ctm= { 0, -1, 1, 0, 40.04,      279 }}
    item20={source2=+19,clip={0,0,792,612},ctm= { 0, -1, 1, 0, 693.364,    279 }}

    item1={source2=+0,clip={0,0,792,612},ctm=    { 0, 1, -1, 0, 1305.36,   1157.4}}
    item3={source2=+2,clip={0,0,792,612},ctm=    { 0, 1, -1, 0, 653.04, 1157.4}}
    item5={source2=+4,clip={0,0,792,612},ctm=    { 0, 1, -1, 0, 1305.36,    869.4}}
    item7={source2=+6,clip={0,0,792,612},ctm=    { 0, 1, -1, 0, 653.04,   869.4}}
    item9={source2=+8,clip={0,0,792,612},ctm=    { 0, 1, -1, 0, 1305.36,    581.4}}
    item11={source2=+10,clip={0,0,792,612},ctm= { 0, 1, -1, 0, 653.04,   581.4}}
    item13={source2=+12,clip={0,0,792,612},ctm= { 0, 1, -1, 0, 1305.36,    293.4}}
    item15={source2=+14,clip={0,0,792,612},ctm= { 0, 1, -1, 0, 653.04,   293.4}}
    item17={source2=+16,clip={0,0,792,612},ctm= { 0, 1, -1, 0, 1305.36,      5.4}}
    item19={source2=+18,clip={0,0,792,612},ctm= { 0, 1, -1, 0, 653.04,     5.4}}

    item21={source1=+0,clip={0,0,1440,1345.68},ctm={1,0,0,1,0,0}}
    item22={source1=+1,clip={0,0,1440,1345.68},ctm={1,0,0,1,0,0}}

There are two "sources" defined source1 and source2. source1 is the stream of backgrounds and source2 is the stream of items to impose. Each itemn entry defines a location where an imposed item is placed. A clip and CTM (transformation matrix) is also supplied.

There is also a way to specify the cycle for each input stream - cycle being the number of pages to step each time (items are offset, e.g., "source1=+2", from the current page in the cycle).

The interesting part of this is that the stream of imposed pages is "virtual", i.e., defined by the document id stream I described above.

The internal architecture decodes the documents ID's into a stream of pages. The XM architecture is defined such that page number that are out of range may occur in the input page stream. When such an occurrence is found a blank is produced. So the document ID stream gets converted to something like:

5 6 99999999 99999999 11 12 1 2

Where 99999999 basically causes a blank page to be produced.

Wednesday, August 4, 2010

Looking ahead...

So as I complete the posting over in Lone Wolf on "Industrial Color Management" I am starting to think about what to cover here.

Basically a less capable version of our color management system has been in real-world production for half a year or so.

What I am interested in now is finding others who have needs in this area - specifically PDF color transformations.

At the end of the day color is just another data driven variable.

But I want to be clear - we are not parameterizing the creation of PDF with color - like a data driven Illustrator or something - we are parameterizing the alteration of existing PDF (or TIFF or JPG or AFP).

Anyway I hope to complete the Lone Wolf discussion within a week or two.