Access Keys:
Skip to content (Access Key - 0)
Welcome to Muck and Brass, the Snowtide blog site    

News from February, 2009

blog entry  2009/02/03

I hadn’t tried TweetDeckyet, and thought I’d give it a run.  It requires Adobe AIR, and I thought I’d end up having to do the download/install dance.  But lo-and-behold, the TweetDeck “Install Now” Flash button bootstraps a local install of AIR for me!  No mess, no fuss, and the whole thing took less than two minutes and three clicks.

That’s cool and all, but we’re (mostly) JVM partisans here, so the inevitable question is: why doesn’t Sun use the same mechanism to drive deployment of the latest and greatest JRE/JavaFX?  Flash functionally has 100% penetration, so it makes tons of sense to use it to make it trivial to get the JRE out there, at which point Java’s “native” update functionality can take over.

I’m not a RIA guru by any stretch, so maybe there’s a good reason why this isn’t done?

Posted at 03 Feb @ 6:23 PM by user Chas Emerick | comment 0 comments
blog entry  2009/02/26

I’m generally not one to be impressed by celebrity — you won’t catch me reading People or US Weekly, example.  However, this morning I noticed with a shimmer of glee that Peter Norvig used some code that I wrote years ago in one of his recent projects.  So, just for the record, if Dr. Norvig ever shows up in US Weekly, I’ll pick one up!

In case you don’t know, Peter Norvig is the Director of Research at Google.  That’s interesting, but the real reason Dr. Norvig holds sway with me is his classic book, Paradigms of Artificial Intelligence Programming.  If it weren’t for that book, I almost certainly would not be doing what I’m doing today.  Its pages are where I came to understand lisps, and began to imagine what was possible and what I might be able to accomplish in computer science (final results yet to be determined, of course).  For that, I am extraordinarily grateful to him (and others, of course, but I’ll wait to talk about them when they get around to using some of my code! ;-) ).

Back to the story.  This morning, I decided to hop onto Google Analytics for a bit to check up on the traffic stats for our various websites.  Lo and behold, in the “top referrals” listing, I saw ‘norvig.com’; “Well,” I thought to myself, “that’s interesting!”   A quick grep of the server logs (is there a screen in Google Analytics that actually provides you with the full referral URLs?) showed the referral URL to be Dr. Norvig’s “post” from last week, An Exercise in Species Barcoding.

A search of my name on that page shows that he needed a way to calculate the Levenshtein distance (also known as the edit distance) between two large strings — his quick implementation (like most) operated in O(n^2) space, which would have required weeks of processing time in his particular case.  So, he looked around for a more efficient implementation, and found one that I wrote in October of 2003 that operated in linear space bounds (and was, ironically enough, my first-ever contribution to an open source project).  With a couple of tweaks to suit his specific needs, the code I wrote worked out nicely for him.

This story is satisfying and funny (for me, anyway) in a couple of different ways:

First, there’s the fact that (what I would now consider) throwaway work of mine floating around the nets six years later.  Remember kids, the Internet never forgets!

Second, it reminded me of what I was doing when I wrote that particular code.  I was building what would later become PDFTextStream’s first ground-truthing system1(although I don’t think I knew of that term at the time). It’s a lot more sophisticated now, but back in 2003, I was simply trying to set up a “ground truthing” system where the full (vetted and known-good) extracted text from each PDF document in our nascent test repository would be saved off somewhere, and later builds of PDFTextStream would compare its extracted PDF text to those saved files.

Of course, it wouldn’t be practical to require that PDFTextStream produce identical output forever — some amount of slop had to be allowable, because (for example) if an extracted word was outputted with four spaces before it instead of two, that would generally be sufficient.  For that and other reasons, I wanted to test that current PDF text extracts were the same as the known-good extracts within a defined margin of error.  Unfortunately, I was ground-truthing full document extracts at that time, and most Levenstein functions with their quadratic performance characteristics would take a lot of memory to diff the multi-megabyte strings that were involved.

Solution: write my own Levenshtein function (loosely based off of a pedagogical implementation by Mike Gilleland that had been incorporated into the Apache commons-lang project) that operated in linear space bounds.  Thankfully, I opted to offer the improvement back to the Apache commons-lang project and to Dr. Gilleland — had I not, Dr. Norvig would never had found that code, and I wouldn’t be writing this right now.

Third and finally, this story is satisfying because, hell, Peter Norvig used some of my code.  A person I respect and admire has found it convenient to use some minor thing I created years ago, and was thoughtful enough to say so.  I hope I can follow that example as I go along in my travels.

See, Dr. Norvig, I’m still learning from you.

Footnotes:

1 Ground truthing is a testing methodology often used in document processing systems where ideal or otherwise known-good output is cataloged, and then actual or current output is compared to it to determine relative accuracy.  PDFTextStream’s current ground-truthing system serves as a semi-rigorous smoke test of its aggregate text extraction accuracy while we’re doing active development, as well as an ironclad regression test for when we’re looking to cut a release.  Thankfully, it’s come a long, long way from the very naive approach I was pursuing in 2003.

Posted at 26 Feb @ 6:23 PM by user Chas Emerick | comment 2 comments
Founder, Snowtide Informatics

About Me

I'm the founder of Snowtide Informatics. We make DocuHarvest, a web application that turns your valuable documents into data, and PDFTextStream, a PDF text extraction library for Java and .NET. I do a lot of programming in Clojure and just a little in Java, trying to make it easier for people to make unstructured content just a little more useful.

    Topics

    Archives

    1. 2010
      1. July
      2. June
      3. May
      4. April
      5. March
      6. February
      7. January
    2. 2009
      1. December
      2. November
      3. October
      4. September
      5. April
      6. March
      7. February
      8. January
    3. 2008
      1. November
      2. July
      3. May
      4. March
    4. 2007
      1. November
      2. October
      3. April
      4. March
      5. February
    5. 2006
      1. December
      2. October
      3. September
      4. August
      5. January
    6. 2005
      1. September
      2. August
      3. July
      4. June
      5. January
    7. 2004
      1. December
      2. September
    Adaptavist Theme Builder (3.3.5-conf210) Powered by Atlassian Confluence 3.0.2, the Enterprise Wiki.