Sunday, July 25, 2010

So our solution for this customer is based on a product we call pdfExpress XM. This product is like our basic pdfExpress product but its designed to handle these large files. We created it because the customer called up a few years ago with a problem related to an inkjet printer they had. Seems the printer had performance problems RIPing files but the problems went away when the PDFs went through out application.

Only problem was that the files were larger page-count wise then we typically can handle and our application ran out of memory. Since this was a "rush" I had to come up what a scheme to make very large files work.

The idea I came up with is a PDF compiler that prepares the PDF files for a second pass in such a way as to ensure that very little memory is needed. Our original software was designed with the basic strategy of consuming the entire PDF file before it starts to process anything. While this was a reasonable strategy in 2001 its no longer valid.

Basically when we inspect the entire PDF file we have to compute the set of resources used by each page so that if the page is used in an output file we know what resources to include. We also have to process the contents of the file to see what resources are used by the page. So, for example, virtually every PDF file has a Font resource call "F1". So if I wanted to place two PDF pages from two different files (or even from the same file because there is no guarantee that the font "F1" on page one is the same font called "F1" on the second page) on the same page as part of an imposition there are only two ways to do it. The first is to use a PDF form object - which is kind of like a PDF subroutine. It "hides" the resource issue by creating a new scope for the F1 on each page.

While doing the form thing in PDF is fairly easy it doesn't require you to have full control over the page content in the sense that you can know every resource used, i.e., you can embed the page in a form without having to deal with the content. Since they path I can from was PDF VDP I had created (and patented US 6,547,831) involved "editing" PDF content to insert variable data I declined to follow the form route.

So I inspect all the named resources like "F1" and give them a guaranteed unique name relative to what I am doing with them. Long story short all this uses a lot of memory - especially editing PDF content streams - so I created a means to "compile" most of this processing in one pass and have a second pass that uses the predigested resource data to do the real work.

The "compiled" PDF is still a PDF. We just add metadata to the Cos object tree for our second pass. If you're not our second pass no problem - it just looks like a PDF. If we find our metadata we use it instead - saving tons of memory and allowing us to handle really large files.

No comments:

Post a Comment