Friday, July 23, 2010

Really high page count PDF files

Today its surprising that no one, for the most part including professionals, understand the issues with PDF file size. Most people believe that "the larger the file size in bytes the slower the file will RIP".

Nothing could be further from the truth. RIP time is almost always proportional to the complexity of the pages as well as the total number of pages. These issues interact and are non-linear. Basically there are things that can happen on a page which make the page RIP very slowly and there are things you can do on a page that consume limited RIP resources that, as the page count increases beyond a few thousand, choke the RIPs performance.

In designing my PDF library it was very important for me to avoid these usual pitfalls. Since I solve RIP performance issues professionally I did not want to create more problems for myself by creating a library that compounded the problems.

(This is one reason I am an outcast to some degree. I did try to purchase a PDF library from a third party but it never worked right - I quickly abandon it and wrote my own.)

PDF is first and foremost just a structured file format. While interpreting the PDF file causes an image to appear there is no reason that the file content cannot be treated as data. In order to address performance issues I try to look at the file format as purely data - its irrelevant what the "data" does when, for example, merging PDF files or pages. What is important is that you respect what the data does.

To this day people do not understand this concept.

Initially my products started out as Acrobat Plug-ins. After struggling with performance issues customers kept bringing up I tried the aforementioned library. When that failed I created my own.

The critical item with PDF is to understand that it is like compiler output prior to linking. The compiler does not know where in memory your program will be located so it writes out a generic description of the program which the linker can easily relocate. pdfExpress does the same thing - while I could have used Adobe PDF forms to accomplish much of what I needed at the time - it would have limited my options later. Instead what I do when combining PDF files is I 'relocated' the PDF structures. So pdfExpress (and the whole family of products) is kind of like a linker - each PDF file is made independent of the others so that it can be combined with as many other PDFs as needed.

While this was a lot of work initially the product family has provided superior performance and customer results for almost a decade. No one else, as far as I know, has ever figured this out.

So why is all this important to the "customer discussion" I started in the last post?

Because my notion of "relocation" with PDF has given me two things I need for 2010 - one is speed - I can process pages about 1,000 times faster than any typical PDF library and I can process PDF files with a 1,000 times more pages. So for large page count PDFs were typical products fail - I win.

The second notion is that I have to look at each PDF construct on every page (the PDF operators) to do the "relocation". At the time this was way more work than anyone else did to process PDF. But today, when I need to change the color in a PDF file - why I am already touching everything in the file that has to do with color and I am "converting" it for relocation. So it was no big deal to change "convert for relocation" to "change for color".

No comments:

Post a Comment