The PDF Outsider: July 2010

Tuesday, July 27, 2010

New Color Management Technology to be Released...

I think this will be of great interest to those following this thread...

follow it here...

The core technology will be discussed on Lone Wolf as it is not PDF specific.

Processing Color...

Given all this complex machinery for manipulating a PDF - what do we do in the second pass that interesting?

For one thing we can add PDF resources, like new color spaces, that were not previously part of the PDF file. So, for example, if I wanted to change /DeviceRGB to a color space with an ICC profile I would have to add the new color space and profile into the Resource dictionary.

Another thing that happens is that resources are renamed to be compatible as described in the previous post. Because we rename them uniquely we can easily associate subsequent processing (described below) with the correct resource.

A big part of the second pass is something called "scan". Scan proceeds to process each and every PDF operator on a PDF page. It starts at the first object and proceeds to the end. At a minimum it must rename references to resources to use the new portable names. We call this "replace". In general we have machinery that can arbitrarily match some PDF operator and parameters and replace them other other operators or parameters (or remove them entirely).

On top of this machinery we added color processing. Color processing involves replacing one color with another (more on this in future posts). So as we process the PDF operators we may encounter a simple color operator, e.g., "0 0 0 rg". Again, this is basically (at least on the simplest level) a "scan" and "replace" operation; but in the case of color an elaborate color engine has been added to handle defining matches and replacements. The color machinery is able to look up the notion "0 0 0 and rg" and come up with a replacement. A simple replacement might be "0 0 0 1 k".

In the case of resources there may be images involved with also have associated color spaces. Again the same sort of processing occurs - but this occurs at the time of resource processing because the color handling for images is usually part of the image. Images typically involve RGB to CMYK translations for most of our customers.

Sunday, July 25, 2010

So our solution for this customer is based on a product we call pdfExpress XM. This product is like our basic pdfExpress product but its designed to handle these large files. We created it because the customer called up a few years ago with a problem related to an inkjet printer they had. Seems the printer had performance problems RIPing files but the problems went away when the PDFs went through out application.

Only problem was that the files were larger page-count wise then we typically can handle and our application ran out of memory. Since this was a "rush" I had to come up what a scheme to make very large files work.

The idea I came up with is a PDF compiler that prepares the PDF files for a second pass in such a way as to ensure that very little memory is needed. Our original software was designed with the basic strategy of consuming the entire PDF file before it starts to process anything. While this was a reasonable strategy in 2001 its no longer valid.

Basically when we inspect the entire PDF file we have to compute the set of resources used by each page so that if the page is used in an output file we know what resources to include. We also have to process the contents of the file to see what resources are used by the page. So, for example, virtually every PDF file has a Font resource call "F1". So if I wanted to place two PDF pages from two different files (or even from the same file because there is no guarantee that the font "F1" on page one is the same font called "F1" on the second page) on the same page as part of an imposition there are only two ways to do it. The first is to use a PDF form object - which is kind of like a PDF subroutine. It "hides" the resource issue by creating a new scope for the F1 on each page.

While doing the form thing in PDF is fairly easy it doesn't require you to have full control over the page content in the sense that you can know every resource used, i.e., you can embed the page in a form without having to deal with the content. Since they path I can from was PDF VDP I had created (and patented US 6,547,831) involved "editing" PDF content to insert variable data I declined to follow the form route.

So I inspect all the named resources like "F1" and give them a guaranteed unique name relative to what I am doing with them. Long story short all this uses a lot of memory - especially editing PDF content streams - so I created a means to "compile" most of this processing in one pass and have a second pass that uses the predigested resource data to do the real work.

The "compiled" PDF is still a PDF. We just add metadata to the Cos object tree for our second pass. If you're not our second pass no problem - it just looks like a PDF. If we find our metadata we use it instead - saving tons of memory and allowing us to handle really large files.

Friday, July 23, 2010

Really high page count PDF files

Today its surprising that no one, for the most part including professionals, understand the issues with PDF file size. Most people believe that "the larger the file size in bytes the slower the file will RIP".

Nothing could be further from the truth. RIP time is almost always proportional to the complexity of the pages as well as the total number of pages. These issues interact and are non-linear. Basically there are things that can happen on a page which make the page RIP very slowly and there are things you can do on a page that consume limited RIP resources that, as the page count increases beyond a few thousand, choke the RIPs performance.

In designing my PDF library it was very important for me to avoid these usual pitfalls. Since I solve RIP performance issues professionally I did not want to create more problems for myself by creating a library that compounded the problems.

(This is one reason I am an outcast to some degree. I did try to purchase a PDF library from a third party but it never worked right - I quickly abandon it and wrote my own.)

PDF is first and foremost just a structured file format. While interpreting the PDF file causes an image to appear there is no reason that the file content cannot be treated as data. In order to address performance issues I try to look at the file format as purely data - its irrelevant what the "data" does when, for example, merging PDF files or pages. What is important is that you respect what the data does.

To this day people do not understand this concept.

Initially my products started out as Acrobat Plug-ins. After struggling with performance issues customers kept bringing up I tried the aforementioned library. When that failed I created my own.

The critical item with PDF is to understand that it is like compiler output prior to linking. The compiler does not know where in memory your program will be located so it writes out a generic description of the program which the linker can easily relocate. pdfExpress does the same thing - while I could have used Adobe PDF forms to accomplish much of what I needed at the time - it would have limited my options later. Instead what I do when combining PDF files is I 'relocated' the PDF structures. So pdfExpress (and the whole family of products) is kind of like a linker - each PDF file is made independent of the others so that it can be combined with as many other PDFs as needed.

While this was a lot of work initially the product family has provided superior performance and customer results for almost a decade. No one else, as far as I know, has ever figured this out.

So why is all this important to the "customer discussion" I started in the last post?

Because my notion of "relocation" with PDF has given me two things I need for 2010 - one is speed - I can process pages about 1,000 times faster than any typical PDF library and I can process PDF files with a 1,000 times more pages. So for large page count PDFs were typical products fail - I win.

The second notion is that I have to look at each PDF construct on every page (the PDF operators) to do the "relocation". At the time this was way more work than anyone else did to process PDF. But today, when I need to change the color in a PDF file - why I am already touching everything in the file that has to do with color and I am "converting" it for relocation. So it was no big deal to change "convert for relocation" to "change for color".

Thursday, July 22, 2010

A customer identifies a problem...

I have a customer that receives tens of thousands of transpromo PDF pages a day for printing (mostly statements, bills, notices and the like filled with ads and other colorful elements). These pages come in groups from various sources along with meta data for mail sort processing.

The problem is that the color space is different in each file.

By this I mean that for one set of files someone was shown a set of proofs on an unknown paper in unknown lighting conditions using values for CMYK or RGB that "looked good". This same process with different paper, lighting, etc. was done for all the sets of PDF pages.

The problem I was asked to solve is "How do we print these jobs out every day and make sure the color is acceptable to the customer?"

There are some considerations that make this problem easier than it might be. First off the color issues are relatively predictable, i.e., a logo has bad color or the green bar across the page is wrong, that sort of thing - we are not talking about pages of full color images. There are images on the pages - but mostly as parts of ads.

We began work on this project about 3 or 4 years ago. At the time the requirements for "fixing" the color were simple - one logo or a color here or there.

Of course the usual "use ICC profiles" was mentioned but that was a miserable failure for several reasons: First the work force is untrained in color and has a primarily black and white mailing background. Second the source material was out of their control. Third there was no concept of monitoring the color during production.

Another requirement was that the files could have up to 100,000 pages so PDF utilities for "fixing color" did not work.

Wednesday, July 21, 2010

What is the "PDF Outsider"

Since blog sites don't have nice neat ways to create categories I have created this blog to track my thoughts about PDF.

PDF is an Adobe, Inc. file format created in the 1990's to replace physical paper.

Overall they've done quite well...

For the last 12 years I have created products that process PDF files in ways that are completely different than those found in most products. These products have been very successful over the last decade.

People are interested in what I have done so I have created this blog for talking about it. It is linked from The "Lone Wolf" Graphic Arts Technologist as a PDF subcategory.

So why am I the PDF Outsider?

A couple of reasons:

First - I wrote my own PDF library (software for manipulating PDF files) based on a premise counter to every other PDF library in existence. It treats PDF as data instead of something to rasterise (more on this later...)

Second - I don't believe in the "standard PDF dogma".

Third - I have proven my beliefs valid by my own success.

This blog will be technical, complex and hard to follow. Most who read it will probably not understand.

I created my company Lexigraph, Inc. to sell my creations...

As a side note it is assumed that you are familiar with PDF, color for commercial print, workflows, and programming at a minimum. If not it will be rough going...

This blog is Copyright (C) Todd R. Kueny, Sr.