Friday, October 15, 2010

Where's the PDF Development?

Though Adobe has passed PDF off into the public domain there are still active pockets of PDF development going on around the world.

Google's CHROME seems to have some active PDF development activity though the focus seems to be on making Chrome use non-Adobe plug-ins for PDF rendering.

There is also "PDF Quick View" when you do a Google search that turns up a PDF (probably in gmail and other Google products as well - but I don't use those...).  Its covered in this blog.  This seems to work well over all - its generally a lot nicer than fooling around with the Adobe plug-in to see something.

The sad part of all this, though, is that Google is too stupid to understand the full picture of PDF, the one that includes print.  (See this from Lone Wolf.) They only care about PDFs that fit into their model.  A little digging will demonstrate that they don't fully support PDF rendering yet.  I suppose they are working on it and one day will.  But as we have seen they don't get color or many other things that many of  us in the printing world know, love, and, most importantly NEED from PDF files.

Google is the king of ad placement - though I don't know if Adobe's exit from "PDF Ads" is good or bad in that regard.

More PDF activity is going on relative to SilverLight, PDF editors, and other tools.

There is also the ubiquitous http://www.planetpdf.com/ - though the forums and things there have not had much for years.

My fear is that without any sort of leadership PDF will become fully bastardized over the next couple of years.  By that I mean that no one will step in to fill the leadership vacuum left by Adobe's exit from that position.

Each company with its own ax to grind will pick up the parts they care about and ignore the rest.  Companies like Google will tear PDF apart faster than the rest because of their ubiquity.

Thursday, October 7, 2010

PDF - Technology to Live Without?

I found an interesting post here.  Basically from an enterprise perspective PDF is a no-no - at least on the web - and it comes in at #3 on the list.  I have seen this in the real world - many corporate types cannot receive PDFs in emails because they are blocked by the corporate fire wall.  My belief is that IT types don't like PDF for a couple of reasons. 

First, though it is relatively secure there are some clumsy problems like Javascript that make it seem like a risk.
 
There are various hack-schemes associated with PDF, javascript hacking chief among them from what I can see.  This basically involves some mechanism to run or get you to run a nefarious javascript that has either been embedded in the PDF or is somehow linked to it via web browsing.

Adobe offers fixes for the elements that involve using the PDF to display a dialog that tricks the user into running a malicious app from the PDF as documented here.   This is all linked together via the Zeus BotNet.

Second, the machinery of PDF is opaque to IT types.  This is kind of an interesting point.  I tracked down a Black Hat document on PDF threats (itself a PDF!)  Eric Filiol, the author, is the Head Scientist Officer of the Virology and Cryptology Laboratory at the French Army Signals Academy.

Basically this document outlines some of the attacks I describe above as well as covers some PDF basics.

What is of interest to me is that its relatively shallow in the nature of what it covers.  PDFs are relatively complex files and there are quite a few malicious holes in them.  But this analysis stops short of doing much more than a superficial inspection.

They do cover the various Forms actions you can associate with elements of a PDF and they also cover some about registry settings and what they can allow or not allow in terms of security.

I suspect the reasons for this are that to process the guts of a PDF you need some relatively sophisticated technology.  The paper describes the PDFStructAzer which is a tool they wrote to monkey with PDF files for hacking purposes.

I sent this guy an email offering to discuss PDF with him - but so far I have not received a response.

Third, and probably most importantly, is that the Adobe Acrobat and Flash worlds are relatively closed.  What I mean by this is that on the IT side of the world there is a lot of activity and interaction between the developers and the corporate folks.  Back and forth on the Microsoft side over formats, developer kits, and so on.  IT folks don't like closed because it makes their jobs harder to do.

Silverlight, for example, is kind of a Flash/PDF replacement for web use.  This went through a long beta period with lots of user input from developers.

Try that with an Adobe product.

From the AFP perspective there is much to learn here.  AFP is much less complex security-wise than PDF so I doubt you will have nearly the issues coming from that side of things.

Thursday, August 19, 2010

How do you know your PDF is correct? (cont.)

I guess over time I have come to adopt the PDF correctness model that says "if the PDF is generally correct, i.e., works in a variety of applications across versions, operating systems, and processors and it works correctly for a given application then we say the PDF is in fact correct."

There are a few PDF validating tools out there you can buy.  Some large printing houses have them to validate their internal workflow or external PDF inputs (like ads in a newspaper).  But these are of little use if the PDF that comes out doesn't work or if you don't have the money to buy one.

We have used this type of model for more than a decade and it has always worked well.

Periodically we come across new vendors or vendors updating their PDF output and we run into interesting problems.  Recently a customer came to me quite panicked and said "some of my output is missing".

This is the second cardinal sin of workflow - having an application make some output that is supposed to be on the page go away.  (The first cardinal sin is having the wrong output on the page.)

So I took a look at the PDF.  All PDFs have an "Info" section where you have a string containing the authoring application's name, a time stamp, etc.  Well, lo and behold, way down deep in the resources for the page in question there was a color space definition.  (A color space definition says things like "CS1" means to use calibrated RGB for all colors marked by the CS1 tag.  Well here was the name "CS1" and its definition, instead of legal PDF color space, was a link to the Info section of the PDF document.

Our application, in this case pdfExpress XM, iterates through the color spaces because it might want to change them.  As it does so it checks to see if each color space is one it cares about.  Part of this checking is to check whether or not it conforms the PDF standard for color spaces.  When if found this particular color space definition it generated an error.  Unfortunately several levels up in the code we made the assumption that the return value was always correct and we placed the page contents based on this.  When this error occurred we branched around the code that injected content into the page and so that part of the page was blank.

So in this particular case its easy to say "the PDF is wrong" - which it is.  But our handling of the bad PDF was also a problem and the customer was unhappy. 


The real question here is as a PDF application developer how do you anticipate arbitrarily wrong PDF input.

The portion of the PDF in question is basically a dictionary (as we discussed a few days ago) where the entry in the dictionary for the value is incorrect.   This is many levels down in a structure which is otherwise correct.  While its fairly easy to check for what should be there handling the cases where something that's not supposed to be there is not.  For example, most places in PDF including dictionary entries, can be indirect.  This means that instead of an actual value being present there is a pointer to some object.  Normally you have to locate this object and then examine it as if it were the entry.

Wednesday, August 18, 2010

How do you know your PDF is correct? (cont.)

While there are tools available to purchase is this regard their use does not ensure that a PDF will work in practice.   Each version of PDF has specific features, e.g., transparency, that it enables, has specific features that are replaced with alternate constructs, and so on.

So the first thing you have to figure out is which version are you trying to create.  In my experience you always want to check that the PDF is compatible with the oldest version that supports all the features you need.  You can check for correctness relative to new versions as well but this limits the usefulness of the PDF.  Of course, "A" list companies always want to force your PDFs to be the latest, most complex version - but that's not always the best for you or your customer or your application.

Once you have decided on a PDF version the simplest way to "validate" a PDF is to use a variety applications to process it and see if the results are correct.  For RIPs and viewers this basically means processing the PDF and checking the output and logs.   We tend to use a spectrum of newer and older tools, RIPs and applications for this.  The reasoning is that if it opens and works in older tools as well as newer tools the PDF is much more likely to be right.

Our tools have been in operation and continuous customer use for almost a decade at this point so we only contend with new "features" for the most part.

Backward compatibility is also important and in general we tend not to add features just because we can.  Why?  Mostly because the customers who use our products have their own set of tools which they have been using a long time and don't want to have to re-verify that any changes we have made to not negatively impact their workflow.

When you think about all of this together you start to see that there really isn't such a thing as "correct PDF" because that depends on the application and usage.  I can continually update my PDF output modules but I may break customer workflows by doing so.

Tuesday, August 17, 2010

How do you know your PDF is correct?

This question comes up from time to time.  As far as I know there is not programmer or application which can tell you if your PDF is in fact "correct" PDF.

Before we get into that let's look into what "correct" means in this context.  PDF is both a language and a structured file format.  Unlike PostScript, which is also a language, PDF cannot be extended.  PostScript, based on the Forth programming languages, allows you to define new language constructs and make use of them in your documents.

There have been all sorts of problems with PostScript in this regard - at least from the perspective of someone trying to make use of the PostScript content of documents.  It was easy to create convoluted, buggy ways to do things and one could never be sure if the constructs created really worked right.  A big issue with this was "page independence".  Programmers used to create PostScript files that, like programs, had to be executed sequentially, i.e., page by page, in order for the PostScript programming to work right.

PDF was meant to solve this - which it did by creating a language that was not could not be extended.  However, it also opened another can of worms. 

The structure of a PDF file is organized loosely around that of a hierarchical database.  At the top of the tree you have the "Root".  Below the root you have indexes of pages.  Page have elements like Resources.  Resources have things like fonts and images.  Most of the important parts of the file are organized around "dictionaries".

A dictionary is a structure with keys and values:

  << /EntryA 1
       /EntryB 2
  >>

In this case there are two keys (EntryA and EntryB) and two values (1 and 2).  The keys are used to retrieve the values - so I can lookup "EntryA" and get the value 1.

Using simple dictionaries doesn't present a problem.  But PDF uses dictionaries to hold other dictionaries, arrays, and other complex entities.  In addition PDF allows two dictionaries to share the same value.  So, for example, if a font is used on every page I don't have to duplicate the definition of the font; I merely store a link to a common definition in the dictionary.

So what does all this have to do with defining "correctness?"  While there are certainly many documents that define PDF a lot that has gone on over the years has left legacy issues open, i.e., things were done a certain way early on and never changed to reflect changes to PDF.  Another big issue is that things like dictionaries in PDF have defined entries, i.e., in a Resource dictionary you have a /Font entry and that's where fonts for the page are found.  But PDF in general doesn't say anything about putting other things into these structures, e.g., I can add an Elephants entry in the Resource directory and it will be ignored by most applications because they are not looking for it and don't care about it.

(Note: This is not an issue in file formats like AFP which are not based on a structured dictionary type of model.)

So what does it mean for a PDF to be correct?  The answer is I am not sure.  We always use a fairly complex test to determine "correctness".

Monday, August 16, 2010

Thursday, August 12, 2010

Debugging Color...

I've had some interesting experiences in this regard - particularly with those unskilled in color.

First off - what do I mean by "debugging color"?  In large complex systems color failures do not occur on a screen where a message box pops up and says "sorry your job failed".  In these system things are put together in such a way as to work 24x7.  They process standard inputs and standard outputs - all tested and carefully organized.  So a "failure" in this context occurs only when someone actually notices the problem.

What do I mean by this?

Large systems are typically built to be, at least to a certain extent, fault tolerant in the sense that at various steps along the way checks are performed to ensure what is supposed to be going on is actually going on.  Jobs that have various incorrect elements "pop out" in fail folders and humans collect them, diagnose them, and resubmit them for processing (presumably after fixing the problems).

Color problems do not manifest themselves in quite the same way as, say, a missing database entry for a mail merge job.  In this later case a logical integrity check says - "hmmm - there is no entry for this" and the job stops processing.

If the "color is wrong" what happens?  First off - there are no systems along the way to check.  Maybe that logo is now red and not blue.  The software cannot not know this because there is nothing to check it against.  Secondly, something - a human or a machine - has to look at the result of processing the entire job - usually by looking at what's in an output bin - before the problem is identified.  So, after maybe two to four hours of processing through five different servers and ten different processes we discover something is wrong.

So the night operator, whose job it is is to run the jobs nightly, notices that something doesn't look like its supposed to.  In the production world people are trained to notice things that don't look the same as everything else.  In this case the job doesn't look like it did yesterday (or, if you're lucky there's a sample to compare it too).  The logo is a different color or shade.

So at this point the operator probably does not know what is wrong - if anything.  The system that provides working job samples for comparison might be broken, i.e, the operator did not get updated production sample, or the output might be wrong, or the device producing it might be working incorrectly.

Generally an operator will be able to identify problems related to things under their control, e.g., are other jobs producing wrong colors.  So generally the problem does not escalate unless these "standard" sort of issues get ruled out.

So now the problem has "escalated" to the supervisor.  At this point generally things become more interesting.  Someone has to decide if the output is actually "wrong" or if the criteria to judge it is incorrect.  As remarkable as it seems often no one is able to make a precise diagnosis.

The reasons are very interesting.  Typically in a large shop the world of operations, job preparation, QA, graphic arts, and customer service all fall into different silos.  Each silo has its own managers, contractors, vendors, processes, and so forth.

Now the job has to "work backwards" through the system.  At each step those involved (from whatever silo is responsible) have to look at what inputs they received and compare them to the incorrect output.  In the world of color things get more interesting than in the case of, say, data value, i.e., a bad mailing address.

Each silo has its own idea of how color is handled and measured.  Some don't really know anything about it other than how it "looks".  Others tend, like CSRs, tend to think about it in terms of "approved color" or "not approved color".  Programmers tend to think about the numbers in the code that produced the values.

So as the job works backwards each silo applies its own criteria to the output and decides if its responsible for what it sees.  Generally this occurs in an ad hoc meeting on the shop floor where fingers are pointed.

(This will be continued tomorrow on the AFP Outsider blog as it applies to both PDF and AFP workflows.)

Wednesday, August 11, 2010

PDF Size and Performance...

This is a topic which comes up frequently (and no, this is not Viagra spam).

People say "This PDF is too large and it will take to long to RIP."   Basically most people make a direct link between the size of the PDF and the performance they are going to get RIPing the file.

In general this is wrong for several reasons.  First of all, until very recently PDF was always turned into PostScript before rasterizing on a RIP.  Now PostScript is a programming language which means that the RIP must process the language statements in order to create the raster.  All this takes time - especially when you have a lot of nested PDF forms.  So any PDF file would be effectively processed twice - once to parse the PDF to PostScript, again to process the PostScript.

There isn't a one-to-one correspondence between PDF operators and PostScript operators, particularly in terms of complexity, so seemingly simple and short PDF might not be simple PostScript as far as the rasterizer is concerned.

PDFs can be very large due to embedded images.  The most profound effect I have seen on performance (and I mostly work on non-plating devices) is extra resolution, i.e., too many bits for the job.  Shear volume is the first problem - a 2,400 dpi CMYK image takes a long time for the RIP to consume because there are a lot of bytes.  If you only need 600 dpi then don't put a 2,400 dpi image into the file.  RIPs process images quickly but can be overwhelmed by sheer volume.

So even though there are lots of images the file may still RIP quickly.

Font usage is a weakness in many RIPs - particularly using a lot of fonts.   There are many cases of PDFs having a completely new set of fonts on each page in, say, a thousand page document.  RIPs don't deal well with this and for me its been a problem over the years.  This issue compounds another common font issue - bitmap fonts.  Many programs, particularly those that convert to PDF, tend to convert source bitmap fonts to PDF bitmap fonts.  My experience is that the higher the percentage of bitmap fonts in a file the slower, in general, it will RIP as page count goes up.

Applications go to great lengths to obfuscate fonts so that font licenses can't have their intellectual property stolen.  Unfortunately you may be paying the price for this with crappy RIP performance - so you're paying twice - once for the font if you need it and again to RIP hundreds or thousands of copies to prevent you from stealing it.

The last problem area is transparency.  There are two types - masks and true PDF transparency - and both create performance issues.  (Most of what I will say here is more or less invisible to a use because "A" list applications try very hard to "hide" the fact that transparency is being used.)  Basically any situation in a PDF file where you can "see through" a transparent part to some underlying PDF element is a problem.  For transparency masks, which can occur in any PDF, the problem increases as the size of mask increases.  For true PDF transparency (controlled by the graphic state) any use is a problem for performance.

The issue is simple - a printer cannot render "transparent" ink. So, if a blue line crosses over a black line and we want to see some of the black line through the blue one the RIP has to calculate what colors the effect of the transparency would have and print a gray-blue color to present that effect.  The calculation requires the RIP to rasterize the transparent areas, create colors for them, then merge that rasterization result with the result of rasterizing everything else.

The bottom line is that transparent areas are rasterized twice - which slows things down.

So very large PDF files, as long as they avoid these and other pitfalls, will RIP very fast.  At the same time, very small files using many of these bad constructs, will RIP slowly.

Tuesday, August 10, 2010

Imposition Aside...

I got involved in another PDF project the other day...  This involves imposition.  However, there were a few catches.

First off, the imposition itself was fairly simply - single sheet front and back.  Unfortunately the imposition has to occur over a stream of front/back sheets, i.e., the underlying sheets change.  Secondly pages to impose are chosen not by sequence but by bookmark - and bookmarks can also have an indicator which says to leave that position blank.

So basically you have a stream of documents to impose:

  DOCID_01 DOCID_03 DOCID_99 DOCID_12

And so on.  The blanks can appear like this:


  DOCID_01 blank DOCID_03 DOCID_99 DOCID_12 blank

where "blank" is a special document ID meaning skip document pages here (pages per document is constant over the run).

Other than that we're basically dealing with cell positions for each page of the document.

I used our pdfExpressXM software as a platform for this.  It supports the multi-stream imposition (though I had to fix bug relating to relocating resources).  Internally the imposition model looks like this:

    item2={source2=+1,clip={0,0,792,612},ctm=    { 0, -1, 1, 0, 41.04,     1431 }}
    item4={source2=+3,clip={0,0,792,612},ctm=    { 0, -1, 1, 0, 693.36,    1431 }}
    item6={source2=+5,clip={0,0,792,612},ctm=    { 0, -1, 1, 0, 40.04,     1143 }}
    item8={source2=+7,clip={0,0,792,612},ctm=    { 0, -1, 1, 0, 693.364,   1143 }}
    item10={source2=+9,clip={0,0,792,612},ctm=   { 0, -1, 1, 0, 40.04,      855 }}
    item12={source2=+11,clip={0,0,792,612},ctm=  { 0, -1, 1, 0, 693.364,    855 }}
    item14={source2=+13,clip={0,0,792,612},ctm=  { 0, -1, 1, 0, 40.04,      567 }}
    item16={source2=+15,clip={0,0,792,612},ctm=  { 0, -1, 1, 0, 693.364,    567 }}
    item18={source2=+17,clip={0,0,792,612},ctm=  { 0, -1, 1, 0, 40.04,      279 }}
    item20={source2=+19,clip={0,0,792,612},ctm=  { 0, -1, 1, 0, 693.364,    279 }}

    item1={source2=+0,clip={0,0,792,612},ctm=    { 0, 1, -1, 0, 1305.36,   1157.4}}
    item3={source2=+2,clip={0,0,792,612},ctm=    { 0, 1, -1, 0, 653.04,  1157.4}}
    item5={source2=+4,clip={0,0,792,612},ctm=    { 0, 1, -1, 0, 1305.36,    869.4}}
    item7={source2=+6,clip={0,0,792,612},ctm=    { 0, 1, -1, 0, 653.04,   869.4}}
    item9={source2=+8,clip={0,0,792,612},ctm=    { 0, 1, -1, 0, 1305.36,    581.4}}
    item11={source2=+10,clip={0,0,792,612},ctm=  { 0, 1, -1, 0, 653.04,   581.4}}
    item13={source2=+12,clip={0,0,792,612},ctm=  { 0, 1, -1, 0, 1305.36,    293.4}}
    item15={source2=+14,clip={0,0,792,612},ctm=  { 0, 1, -1, 0, 653.04,   293.4}}
    item17={source2=+16,clip={0,0,792,612},ctm=  { 0, 1, -1, 0, 1305.36,      5.4}}
    item19={source2=+18,clip={0,0,792,612},ctm=  { 0, 1, -1, 0, 653.04,     5.4}}

    item21={source1=+0,clip={0,0,1440,1345.68},ctm={1,0,0,1,0,0}}
    item22={source1=+1,clip={0,0,1440,1345.68},ctm={1,0,0,1,0,0}}


There are two "sources" defined source1 and source2.  source1 is the stream of backgrounds and source2 is the stream of items to impose.  Each itemn entry defines a location where an imposed item is placed.  A clip and CTM (transformation matrix) is also supplied. 

There is also a way to specify the cycle for each input stream - cycle being the number of pages to step each time (items are offset, e.g.,  "source1=+2", from the current page in the cycle).

The interesting part of this is that the stream of imposed pages is "virtual", i.e., defined by the document id stream I described above.

The internal architecture decodes the documents ID's into a stream of pages.  The XM architecture is defined such that page number that are out of range may occur in the input page stream.  When such an occurrence is found a blank is produced.  So the document ID stream gets converted to something like:

  5 6 99999999 99999999 11 12 1 2

Where 99999999 basically causes a blank page to be produced.

Wednesday, August 4, 2010

Looking ahead...

So as I complete the posting over in Lone Wolf on "Industrial Color Management" I am starting to think about what to cover here.

Basically a less capable version of our color management system has been in real-world production for half a year or so.

What I am interested in now is finding others who have needs in this area - specifically PDF color transformations. 

At the end of the day color is just another data driven variable.

But I want to be clear - we are not parameterizing the creation of PDF with color - like a data driven Illustrator or something - we are parameterizing the alteration of existing PDF (or TIFF or JPG or AFP).

Anyway I hope to complete the Lone Wolf discussion within a week or two.

Tuesday, July 27, 2010

New Color Management Technology to be Released...

I think this will be of great interest to those following this thread...

follow it here...

The core technology will be discussed on Lone Wolf as it is not PDF specific.

Processing Color...

Given all this complex machinery for manipulating a PDF - what do we do in the second pass that interesting?

For one thing we can add PDF resources, like new color spaces, that were not previously part of the PDF file. So, for example, if I wanted to change /DeviceRGB to a color space with an ICC profile I would have to add the new color space and profile into the Resource dictionary.

Another thing that happens is that resources are renamed to be compatible as described in the previous post. Because we rename them uniquely we can easily associate subsequent processing (described below) with the correct resource.

A big part of the second pass is something called "scan". Scan proceeds to process each and every PDF operator on a PDF page. It starts at the first object and proceeds to the end. At a minimum it must rename references to resources to use the new portable names. We call this "replace". In general we have machinery that can arbitrarily match some PDF operator and parameters and replace them other other operators or parameters (or remove them entirely).

On top of this machinery we added color processing. Color processing involves replacing one color with another (more on this in future posts). So as we process the PDF operators we may encounter a simple color operator, e.g., "0 0 0 rg". Again, this is basically (at least on the simplest level) a "scan" and "replace" operation; but in the case of color an elaborate color engine has been added to handle defining matches and replacements. The color machinery is able to look up the notion "0 0 0 and rg" and come up with a replacement. A simple replacement might be "0 0 0 1 k".

In the case of resources there may be images involved with also have associated color spaces. Again the same sort of processing occurs - but this occurs at the time of resource processing because the color handling for images is usually part of the image. Images typically involve RGB to CMYK translations for most of our customers.

Sunday, July 25, 2010

So our solution for this customer is based on a product we call pdfExpress XM. This product is like our basic pdfExpress product but its designed to handle these large files. We created it because the customer called up a few years ago with a problem related to an inkjet printer they had. Seems the printer had performance problems RIPing files but the problems went away when the PDFs went through out application.

Only problem was that the files were larger page-count wise then we typically can handle and our application ran out of memory. Since this was a "rush" I had to come up what a scheme to make very large files work.

The idea I came up with is a PDF compiler that prepares the PDF files for a second pass in such a way as to ensure that very little memory is needed. Our original software was designed with the basic strategy of consuming the entire PDF file before it starts to process anything. While this was a reasonable strategy in 2001 its no longer valid.

Basically when we inspect the entire PDF file we have to compute the set of resources used by each page so that if the page is used in an output file we know what resources to include. We also have to process the contents of the file to see what resources are used by the page. So, for example, virtually every PDF file has a Font resource call "F1". So if I wanted to place two PDF pages from two different files (or even from the same file because there is no guarantee that the font "F1" on page one is the same font called "F1" on the second page) on the same page as part of an imposition there are only two ways to do it. The first is to use a PDF form object - which is kind of like a PDF subroutine. It "hides" the resource issue by creating a new scope for the F1 on each page.

While doing the form thing in PDF is fairly easy it doesn't require you to have full control over the page content in the sense that you can know every resource used, i.e., you can embed the page in a form without having to deal with the content. Since they path I can from was PDF VDP I had created (and patented US 6,547,831) involved "editing" PDF content to insert variable data I declined to follow the form route.

So I inspect all the named resources like "F1" and give them a guaranteed unique name relative to what I am doing with them. Long story short all this uses a lot of memory - especially editing PDF content streams - so I created a means to "compile" most of this processing in one pass and have a second pass that uses the predigested resource data to do the real work.

The "compiled" PDF is still a PDF. We just add metadata to the Cos object tree for our second pass. If you're not our second pass no problem - it just looks like a PDF. If we find our metadata we use it instead - saving tons of memory and allowing us to handle really large files.

Friday, July 23, 2010

Really high page count PDF files

Today its surprising that no one, for the most part including professionals, understand the issues with PDF file size. Most people believe that "the larger the file size in bytes the slower the file will RIP".

Nothing could be further from the truth. RIP time is almost always proportional to the complexity of the pages as well as the total number of pages. These issues interact and are non-linear. Basically there are things that can happen on a page which make the page RIP very slowly and there are things you can do on a page that consume limited RIP resources that, as the page count increases beyond a few thousand, choke the RIPs performance.

In designing my PDF library it was very important for me to avoid these usual pitfalls. Since I solve RIP performance issues professionally I did not want to create more problems for myself by creating a library that compounded the problems.

(This is one reason I am an outcast to some degree. I did try to purchase a PDF library from a third party but it never worked right - I quickly abandon it and wrote my own.)

PDF is first and foremost just a structured file format. While interpreting the PDF file causes an image to appear there is no reason that the file content cannot be treated as data. In order to address performance issues I try to look at the file format as purely data - its irrelevant what the "data" does when, for example, merging PDF files or pages. What is important is that you respect what the data does.

To this day people do not understand this concept.

Initially my products started out as Acrobat Plug-ins. After struggling with performance issues customers kept bringing up I tried the aforementioned library. When that failed I created my own.

The critical item with PDF is to understand that it is like compiler output prior to linking. The compiler does not know where in memory your program will be located so it writes out a generic description of the program which the linker can easily relocate. pdfExpress does the same thing - while I could have used Adobe PDF forms to accomplish much of what I needed at the time - it would have limited my options later. Instead what I do when combining PDF files is I 'relocated' the PDF structures. So pdfExpress (and the whole family of products) is kind of like a linker - each PDF file is made independent of the others so that it can be combined with as many other PDFs as needed.

While this was a lot of work initially the product family has provided superior performance and customer results for almost a decade. No one else, as far as I know, has ever figured this out.

So why is all this important to the "customer discussion" I started in the last post?

Because my notion of "relocation" with PDF has given me two things I need for 2010 - one is speed - I can process pages about 1,000 times faster than any typical PDF library and I can process PDF files with a 1,000 times more pages. So for large page count PDFs were typical products fail - I win.

The second notion is that I have to look at each PDF construct on every page (the PDF operators) to do the "relocation". At the time this was way more work than anyone else did to process PDF. But today, when I need to change the color in a PDF file - why I am already touching everything in the file that has to do with color and I am "converting" it for relocation. So it was no big deal to change "convert for relocation" to "change for color".

Thursday, July 22, 2010

A customer identifies a problem...

I have a customer that receives tens of thousands of transpromo PDF pages a day for printing (mostly statements, bills, notices and the like filled with ads and other colorful elements). These pages come in groups from various sources along with meta data for mail sort processing.

The problem is that the color space is different in each file.

By this I mean that for one set of files someone was shown a set of proofs on an unknown paper in unknown lighting conditions using values for CMYK or RGB that "looked good". This same process with different paper, lighting, etc. was done for all the sets of PDF pages.

The problem I was asked to solve is "How do we print these jobs out every day and make sure the color is acceptable to the customer?"

There are some considerations that make this problem easier than it might be. First off the color issues are relatively predictable, i.e., a logo has bad color or the green bar across the page is wrong, that sort of thing - we are not talking about pages of full color images. There are images on the pages - but mostly as parts of ads.

We began work on this project about 3 or 4 years ago. At the time the requirements for "fixing" the color were simple - one logo or a color here or there.

Of course the usual "use ICC profiles" was mentioned but that was a miserable failure for several reasons: First the work force is untrained in color and has a primarily black and white mailing background. Second the source material was out of their control. Third there was no concept of monitoring the color during production.

Another requirement was that the files could have up to 100,000 pages so PDF utilities for "fixing color" did not work.

Wednesday, July 21, 2010

What is the "PDF Outsider"

Since blog sites don't have nice neat ways to create categories I have created this blog to track my thoughts about PDF.

PDF is an Adobe, Inc. file format created in the 1990's to replace physical paper.

Overall they've done quite well...

For the last 12 years I have created products that process PDF files in ways that are completely different than those found in most products. These products have been very successful over the last decade.

People are interested in what I have done so I have created this blog for talking about it. It is linked from The "Lone Wolf" Graphic Arts Technologist as a PDF subcategory.

So why am I the PDF Outsider?

A couple of reasons:

First - I wrote my own PDF library (software for manipulating PDF files) based on a premise counter to every other PDF library in existence. It treats PDF as data instead of something to rasterise (more on this later...)

Second - I don't believe in the "standard PDF dogma".

Third - I have proven my beliefs valid by my own success.

This blog will be technical, complex and hard to follow. Most who read it will probably not understand.

I created my company Lexigraph, Inc. to sell my creations...

As a side note it is assumed that you are familiar with PDF, color for commercial print, workflows, and programming at a minimum. If not it will be rough going...

This blog is Copyright (C) Todd R. Kueny, Sr.