Saturday, February 25, 2006

Getting the bugs out (of PDF)

On a contract I am working at, some joker thought it would be funny to give me a list of defects for the project as a PDF file. Actually, they didn't give it to me. It was zipped up in a deployment release bundle, in a zipped up in a folder called "Limitations."

I'm actually lucky to have found it at all. So I suppose I should be happy to have it.

I bet their PDF export script (using XSL:FO, no doubt) is the pride and joy of the consulting firm who used it to output their bugs (note: there is a bug in the output for 'Assign To') and while PDFs have really nice anti-aliased fonts when printed out, I didn't have access to a printer.

Having to go through a large list of defects in PDF, even if they were printed out, wouldn't have been fun.

In all fairness, it was a nice layout, and would have been easy to read on paper, but I wanted something I could manipulate.

Thankfully, Acrobat Reader isn't what it used to be, meaning you can search it, and also, a feature that was added I don't know when, (recently?) "save as text."

I now had a plaintext document that I could search with a text editor, and even more importantly, read more than eleven and a half lines on a page.

But I still wanted more. So I thought I'd slurp them all up into a quick and dirty database to run some queries on and maybe even eventually output in a format that's somwhere in the happy medium between .pdf and .txt

I thought I'd try using Python for this, since I wanted to brush up on it, and I figured it wouldn't take more than an hour.

I got going quick enough: reading a file, check; connecting to database, check;

But I found quick enough that this was not a nicely structured document. After a bit of regex massaging in my favorite windows text editor, Editplus, I still wasn't getting it.

A few futile searchs and I was reading python newsgroups with people confessing of dipping their toes in the "dark side" for text processing.

This isn't a python bashing post, but I put it away after using up the allotted time, and went on to other things.

***

But at 6:15 this morning I was thinking of a solution, and I didn't know if it was real or just a product of that dream state of mind where everything *seems* to make sense.

So I did what any sensible person would do. I ignored it and tried to go back to sleep. It's *Saturday*.

A few minutes later I switched on the stereo, Scorpions "Best of the Ballads, Hot and Slow", a vintage cassette tape, and was soon "In Trance." Before the end of "Yellow Raven", I had my company laptop on, laying in bed, and I was working in PHP.

At eight o'clock the battery died and I got up and put it away.

How did I get from Python to Perl to PHP?

Well, it started out as a simple enough text parsing task, but I found that because of several special conditions (which I should have just ignored) you really needed an ugly procedural mess with a lot of conditions to put it up right. But you needed an object model to store it all, because it wasn't something that could be done with one pass throught the file.

PHP is good at switching from a glob of spaghetti to a set of functions, to using objects as glorified hashes, and now with PHP5, it's even got a decent, though limited object model.

But what I ended up doing was working on my pet project, an ORM for PHP, only to realize when all was said and done, that what I needed in this instance was an ActiveRecord.

Maybe I should have just used Ruby?

0 Comments:

Post a Comment

<< Home