Access Keys:
Skip to content (Access Key - 0)
Welcome to Muck and Brass, the Snowtide blog site    

News from August, 2006

blog entry  2006/08/16

For nearly a year, we have been working on a number of things in parallel:

All three of these things are absurdly complex, and large, and represent a huge amount of work. And, like the geniuses we are, we decided, “Hey, let’s release all of them at once!”

Well, what doesn’t kill you makes you stronger, right? It turns out that this was probably a very bad idea…not because we sacrificed quality or cut corners to make deadlines or anything strictly taboo like that. It was a bad idea because sleep is a precious thing.

I don’t have children (and, technically speaking, I’ll never havechildren, seeing as I’m of the male persuasion), so pardon me while I draw a very tendenous analogy between software development and pregnancy. We just had the software equivalent of triplets — one major product release, one website, and one website/AJAX app, all at once.

This is a good reminder that, 99% of the time, software (and really, business, for that matter), should be incremental. We know this, and have practiced it for a long time — but I can guarantee you, we understand it a lot better now that we’ve broken that rule.

That said, why don’t you say `hi` to our newborns: PDFTextStream v2.0, the new snowtide.com, and PDFTextOnline.

Posted at 16 Aug @ 3:30 PM by user Chas Emerick | comment 0 comments
blog entry  2006/08/21

PDFTextStream started out as a Java library, but is now available and supported for Python. How that leap was made exemplifies how commercial and open source software efforts complement each other in the best of circumstances, and is also a fantastic case study in Java + Python integration.

In general, Java and Python don’t really mix. Their architectures, best-practices, object models, and philsophies are pretty divergent in a lot of ways. Because of this, you don’t often find them cohabiting peacefully.

However, there are significant advantages to be had by bringing these two environments together. Python is a really elegant language, and is very well-suited to whole classes of software development that are much more painful to tackle in Java. Java has its advantages as well: a very mature standard library, a huge array of third-party library support, fantastic development environments, and the backing of big players in IT. As always, there’s a right tool for each job, and sometimes Java works best, and sometimes Python works best, but a combination would truly be more than the sum of its parts.

As PDFTextStream got its legs in the market about 18 months ago, our consulting business picked up, and I began to look for a way to use Python for prototyping and custom development in conjunction with PDFTextStream. Of course, back then, PDFTextStream was only for Java, so some bridge-building was in order.

I came across JPype (http://jpype.sourceforge.net), and found it to be a promising solution. JPype is an open-source Python library that gives “python programs full access to java class libraries”. Sounds good, and it was.

Eventually, however, we ran into some problems. Specifically, one of our clients wanted to have PDFTextStream extract text from PDF documents in-memory (i.e. without having the PDF file(s) on disk). That’s not problem with PDFTextStream — we added that feature in short order.

However, this client was also adamant in their desire for a Python-based solution. The rest of their application (with which our piece integrated) is 100% Python, and their performance requirements (think millions of PDF documents processed per month) made running PDFTextStream as some kind of service component unthinkable.

What’s the problem? JPype, circa summer of 2005, copied data between Python and Java. That means that, if you have a PDF file in memory in Python, and want to use PDFTextStream’s in-memory extraction capability, JPype made a copy of that PDF file data before passing it off into the target Java function or constructor.

Bad, bad, bad. That was a huge performance hit to the application, and simply unacceptable from the client’s (and users’) point of view.

The obvious course of action was to make JPype, in effect, “pass by reference” when working with significant chunks of data (byte arrays, Strings, etc). This was no simple task, but we soon contacted the maintainer of JPype, a friendly fellow named Steve Ménard, and explained our predicament.

Within a few days, he had hammered out the idea to expose Python strings (the byte array of the Python world in most environments) as DirectByteBuffer objects in Java. This was a great idea, and meshed nicely with PDFTextStream’s in-memory processing API. Steve and I hammered out a relatively informal work agreement and hourly rate, and it was assumed by both of us that his enhancements to JPype for our purposes would stay licensed under the Apache v2.0 license to be enjoyed by the rest of the JPype community.

Nailing down all the technical details took a few weeks, but in the end, Steve was successful. We were able to put PDFTextStream’s entire API to use from within Python in a way that sacrificed not one ounce of performance or functionality.

So what’s the upshot of all of this?

  • Our consulting job completed with high praise from our customer, and our component of their application continues to hum away, extracting text from millions of PDF documents per month using PDFTextStream from Python
  • We’ve since worked with Steve here and there as necessary in order to make additional tweaks to JPype. Because of his help, we now distribute a supported version of PDFTextStream for Python (click that for more technical details about the Python/Java integration made possible by JPype).
  • The JPype project retains the new/improved functionality that we paid for, and the broader community continues to benefit from that.
  • Steve got to pick up a new mac mini, plus whatever else he felt like buying with his hard-earned cash

That’s what I call a win-win situation, for us, for our customers, for Steve, and for the JPype project and its other users. In an ideal world, this is how open source and commercial software efforts should collaborate and cross-pollinate.

Posted at 21 Aug @ 4:33 PM by user Chas Emerick | comment 0 comments
blog entry  2006/08/23

PDFTextOnline, our shiny-new AJAX-y PDF text extraction application, is a nifty tool, and we’re getting some decent feedback. However, many people have indicated (not so indirectly) that its user interface sucks. Yeah, OK, our bad.

This is a lesson that was learned about a decade ago, which we didn’t so much as recognize as stumble over. Here’s PDFTextOnline’s user interface currently (click to enlarge):

It doesn’t look too bad, right? Not so shabby for ‘beta’, whatever that means. Of course, using it is a wholly separate matter. The buttons in the toolbar in the upper right corner are entirely opaque as to their meaning — even though the icons use familiar visual metaphors (open folder for the ‘Open File’ action, a disk for the ‘Save As’ action, etc), they don’t seem to work within this environment. The split betwen the drawer on the left and the main text area doesn’t really work quite right, apparently regardless of whether you’re a Windows, Mac, or Linux person.
Users are flummoxed, and don’t see what the path is from pont A to point B.

Those are just a few of the comments we’ve received so far. The point being, of course, that we didn’t design the UI for the web, as we should have — we designed it to mimic a desktop PDF viewer (except PDFTextOnline’s stock in trade is text). Maybe if we redoubled our efforts, we could roll in a new widget set (perhaps those from Backbase, or something similar), tighten all of javascript that worked with the interactive bits to make those parts more snappy, and end up with something that felt more desktop-ish.

Of course, that’s a bad idea. This is not a desktop application, it’s a web application. Duh.

We’ll have something better in a few weeks, promise.

Posted at 23 Aug @ 4:40 PM by user Chas Emerick | comment 0 comments

In my rush to self-flagellate in my last post, I neglected to mention that PDFTextOnline’s ‘Save Text to Disk’ command is now available.  This is really what makes PDFTextOnline worthwhile — being able to get a quality text extract from your PDF documents without spending time copy-and-pasting everywhere.  (Not to mention all of the other advantages that PDFTextOnline gets you, especially Chinese, Japanese, and Korean text extraction capability, which is generally shoddy in ‘regular’ PDF viewers.)

Give a high-quality text extraction tool a whirl.

Posted at 23 Aug @ 4:41 PM by user Chas Emerick | comment 0 comments
blog entry  2006/08/30

Today, we released PDFTextStream v2.0.1--- a minor patch release that contains a workaround for an interesting and unfortunate bug: on Windows, if one accesses a PDF file on disk using PDFTextStream, then closes the PDFTextStream instance (using PDFTextStream.close()), the PDF file will still be locked. It can’t be moved or deleted.

This is actually not a bug in PDFTextStream, but in Java, documented as Sun bug #4724038. In short, any file that is memory-mapped cannot reliably be “closed” (i.e. the `DirectByteBuffer` (or some native proxy, perhaps) that holds the OS-level file handle does not release those resources, even when the `FileChannel` is closed that was used to create the `DirectByteBuffer`). Reading the comments on that bug report show a great deal of frustration, and rightly so: regardless of the technical reasons for the behavior, memory-mapping files isn’t rocket science (or, hasn’t been for 20 years or somesuch), and this kind of thing shouldn’t happen.

Since we can’t fix the bug, we devised a workaround: if you set the `pdfts.mmap.disable` system property to `Y`, then PDFTextStream won’t memory-map PDF files. Simple enough fix. FYI, there appears to be no performance degredation associated with using PDFTextStream in this mode.

Of course, this is only a problem on Windows, which does not allow files to be moved or deleted while a process has an open file handle. We have a number of customers that deploy on Windows Server (although that number is much smaller than those that deploy on a variety of *nix), but until last week, they hadn’t reported any problems. Our best guess is that, given the systems we know those customers are running, they are probably using PDFTextStream’s in-memory mode (where PDF data is in memory, and provided to PDFTextStream as a `ByteBuffer`). Of course, in that case, no file handles are ever opened, so all is well.

This problem is the topic of a new FAQ entry as well.

Posted at 30 Aug @ 4:45 PM by user Chas Emerick | comment 2 comments

I might write more about specifics, but I wanted to get this link out there.  I only now came across Guy Kawasaki’s entreprenuership video series hosted by the Stanford Technology Ventures Program.  It’s from 2004, so I’m probably the last software company owner to “discover” it, but I’m glad I did nonetheless.

I’ve long discounted Guy as “that Mac evangelist from way back” — to my detriment.  He’s really putting some great content and great ideas out there, regardless of what some may think of how he comes off personally.  There are so many aspects of these clips that resonate with my personal experience of launching Snowtide, seeing it fail, and then relaunching it again two years ago (and thankfully seeing it soar this time!).  That kind of connection-at-a-distance is rare, and really valuable, so I’ll certainly be keeping up with (and catching up with) Guy’s doings from now on.

Posted at 30 Aug @ 4:50 PM by user Chas Emerick | comment 0 comments
Founder, Snowtide Informatics

About Me

I'm the founder of Snowtide Informatics. We make DocuHarvest, a web application that turns your valuable documents into data, and PDFTextStream, a PDF text extraction library for Java and .NET. I do a lot of programming in Clojure and just a little in Java, trying to make it easier for people to make unstructured content just a little more useful.

    Topics

    Archives

    1. 2010
      1. July
      2. June
      3. May
      4. April
      5. March
      6. February
      7. January
    2. 2009
      1. December
      2. November
      3. October
      4. September
      5. April
      6. March
      7. February
      8. January
    3. 2008
      1. November
      2. July
      3. May
      4. March
    4. 2007
      1. November
      2. October
      3. April
      4. March
      5. February
    5. 2006
      1. December
      2. October
      3. September
      4. August
      5. January
    6. 2005
      1. September
      2. August
      3. July
      4. June
      5. January
    7. 2004
      1. December
      2. September
    Adaptavist Theme Builder (3.3.5-conf210) Powered by Atlassian Confluence 3.0.2, the Enterprise Wiki.