Access Keys:
Skip to content (Access Key - 0)
Welcome to Muck and Brass, the Snowtide blog site    

News from July, 2005

blog entry  2005/07/13

…like oil and water, right? Not necessarily; we should hope not, for otherwise we’re all in trouble.

Last week, someone anonymously posted a comment to a previous entry of mine. In a nutshell, he or she implied that the benchmarks we publishcomparing PDFTextStream text extraction performance to that of other Java PDF libraries was rubbish. Here’s the comment in its entirety:

If the product is so good, why are your speed comparisons using your latest version against 2 year old products.

Wow, that hurt. I responded with a comment to the same entry, but the original implication was serious enough that I felt compelled to make a more visible statement about the benchmark that we publish.

The core complaint in the comment was that we’re tilting the playing field by comparing PDFTextStream to other years-old Java PDF libraries. That was and is fundamentally untrue, except in the case of Etymon’s PJ library. Here, I’ll quote my response on this issue from my comment in the original entry:

Etymon PJ was abandoned in favor of PJx years ago; PJx hasn’t been under active development since April of 2004 though (see http://sourceforge.net/projects/pjx/), and in its current state provides no API for text extraction that we can see. However, our original benchmarks nevertheless showed the older PJ library to be the fastest of the available libraries (second to PDFTextStream), so we included it even though Etymon doesn’t appear to support it anymore.

Our perspective on this is that we have been trying to be as transparent and honest as possible with these benchmarks from day one; therefore, when searching out Java PDF libraries to compare to PDFTextStream, we wanted to find the toughest competition possible. We found Etymon’s PJ library to be the fastest text extraction library (second to PDFTextStream), so we included it in the benchmark.

I think that’s very fair, and very honest. Frankly, given the sometimes rabid nature of skepticism in some developer circles, we would like likely have been suspected of hiding something if we had originally decided to exclude the PJ library because it’s no longer supported.

Benchmarks have long been viewed with suspicion by technologists of all stripes, but being a publisher of a benchmark has provided me with some perspective. Yes, benchmarks can be gamed; yes, internally-conducted benchmarks canbe more vendor fantasy than reality. We knew this from the start, which is why we made extraordinary efforts to make the benchmark as transparent as possible (by publishing the benchmark code, test files, and methodology along with the bottom-line results). Any skeptics are free to run the benchmarks themselves, and report any observed discrepancies.

If that’s not the gold standard of honesty when it comes to benchmarking, and if a benchmark conducted and published in this manner cannot be trusted by the broader developer community, then we’re all in trouble. There are thousands of software products out there, all of which claim a particular advantage over their competition. Some advantages are qualitative, and cannot be measured — that’s fine. However, other advantages are quantifiable; for these claims, we should all welcome a transparent, published benchmark. Otherwise, the process of selecting software products descends into a matter of who has the better marketing and PR game (not that that hasn’t already happened to a very large extent already, but that’s a different post!).

Fundamentally, I hope the benchmark doesn’t matter. In the end, I would hope that every developer that is looking for a PDF extraction solution for Java would download all of the available libraries and do some real due diligence to determine which library delivers the best features and throughput in their environment. Voilà, everyone wins.

There’s little left to say, except that, if you find our benchmark to be unconvincing, we remain open and receptive to feedback. If there’s a way we can improve the benchmark, whether through changing methodology, test files, or tweaking timing code, we’ll do it.

Posted at 13 Jul @ 12:20 AM by user Chas Emerick | comment 0 comments
blog entry  2005/07/18

PDFlib released a PDF text extraction component, so let’s see how we stack up.

A week or so ago, Dan Shea at PlanetPDF posted a news item about PDFlib releasing a PDF text extraction library. That’s obviously very interesting to us, simply because until now, PDFTextStream has been the only library out there concentrating on PDF text extraction.

My first reaction to reading this news was to shoot an email off to Dan, suggesting that a PDF text extraction library shootout of some kind might be in order. His reply was, “What do you have in mind?”

Well, jeez, I hadn’t gotten that far yet. I assume any comparison of text extraction libraries should focus on a few things immediately critical to the endeavor:

  • Text extract accuracy
  • Operational performance and throughput
  • PDF compatibility (PDF specification support, decryption services, etc.)
  • Auxilliary features (accessibility of other content)

And then there’s the extras that one looks for in any library:

  • Platform/Environment support
  • API clarity
  • Documentation and support
  • Vendor stability and longevity

Obviously, there’s a lot there, and since text extraction is a minute field compared to PDF generation, etc., Dan (or any other reviewer) would likely pick and choose what to focus on. May he (and others) always choose those aspects where we dominate… ;-)

In this particular situation, there’s also the complication of platform support: PDFlib’s component is available on a variety of platforms (through C bindings), whereas PDFTextStream is only available on the Java platform. That gives PDFlib an obvious advantage where Java isn’t in play, since we’re not showing up on .NET, python, etc., yet.

Anything missing here? Feel free to email me with any aspects that you think are important.

Posted at 18 Jul @ 10:38 AM by user Chas Emerick | comment 0 comments
blog entry  2005/07/22

Notes on general housekeeping around here.

Just an FYI — we’ve begun to slowly improve the technical side of this blog. The feeds are now no longer running a day behind new posts, and there are actual ‘next’ and ‘previous’ links at the bottom of the main blog view. It’s the little things, right?

For those that are interested, this blog is run on Quills, a blog product for Plone (the Zope CMS that handles most of the dirty work of keeping our site humming nicely).

Quills is a decent blog platform, probably the best that is out there for Plone. I originally learned of it from Tom Lazar’s blog, which also uses Quills.

Some might say that we should have a ‘Quills Powered’ badge on here somewhere, but truth be told, we’ve rewritten about 1/2 of the product (most of the page templates). It provides a nice object model, but the presentation simply does not work for us out of the box at all, and much of it really doesn’t make much sense to me (the structure of the archive pages, in particular). Shipping our improvements over to the Quills folks is probably a good idea (best to not gripe without pitching in and all that), but a *ton* of cleanup would be required to make it usable outside of our site again. There’s just not enough time in the day.

Posted at 22 Jul @ 10:40 AM by user Chas Emerick | comment 0 comments
blog entry  2005/07/25

I’ve been mulling over the relationship and differences between PDFTextStream’s API and other PDF-related API’s.

I was originally going to write a pretty long tract on this topic, but relented mid-way because I realized that I likely don’t have the concepts straight in my own head, nevermind being able to put them down on screen.

PDFBox, JPedal, and other fine PDF libraries present very comprehensive API’s to a developer-user, ones which mirror the nature of PDF data structures to a hilt. That’s excellent, especially if you need to do some low-level mucking around.

To get to more sophisticated functionality (like the extraction of text, generation of PDF’s etc.), additional API’s need to be laid on top of the lower strata of data structures. It’s a very clean, formal computer science approach that fosters maintainability, reuse of code, and representational consistency.

PDFTextStream takes a somewhat different approach. It is primarily interested in fulfilling a very particular set of developer-user requirements: specifically, the extraction of text and other PDF content with maximal accuracy and throughput. To get there, we simply could not use the layered low-level API approach — while we might be able to make extraction functionality work, the overhead involved in that approach increases dramatically as the complexity of the functional requirement rises.

The result is PDFTextStream’s API, which if shown to an expert in the PDF document format, would look completely foreign. There are no references to PDF objects, dictionaries, names, XObjects, Postscript, or virtually any other PDF-specific data structures. This is because the PDFTextStream API is focussed on providing the shortest route from point A to point B for the developer-user looking to extract content from PDF documents, Period. Obviously, this has the drawback of making the PDFTextStream API singularly useless for anyone who wants to generate PDF reports (for example).

The best terms I can come up with for these types of API’s are ‘transparent’ (for tiered, low-level API’s), and ‘functional’ (for API’s dedicated to a specific functional domain to achieve side-benefits of specialization). Both have their place; transparent API’s are likely to always be more popular (since they have broad applicability), whereas functional API’s are likely to always maintain an edge within their particular domain.

So what’s the point? I find the distinction between ‘transparent’ and ‘functional’ API’s fascinating because the comparision is decidedly nontechnical — it’s about how people interact with the software, and how a software vendor wants to present itself and its product to its users. These might be the kinds of tensions that need to be exploited to make significant strides in software design, since software is still hard to build and hard to use even after the litany of technical ‘revolutions’ that have come and gone over the years.

Posted at 25 Jul @ 10:42 AM by user Chas Emerick | comment 0 comments
blog entry  2005/07/30

A brief word on the response to last week’s missive regarding ‘functional’ and ‘transparent’ API’s.

After my last post, we received two serious inquiries from existing customers about whether (and when) PDFTextStream will provide a transparent PDF API. The general message of these inquiries is that, even though specific applications that are using PDFTextStream require and benefit from its decidedly functional API (focussed on content extraction), it’s occasionally useful or necessary to dig a little deeper into a PDF for other reasons.

Specifically, these customers have been using PDFBox or PJ in order to look at the guts of PDF documents in a way that PDFTextStream doesn’t currently provide for. These usages aren’t in production environments — in both cases, the transparent API’s are being used in a support or troubleshooting role, especially with poorly-formed PDF documents.

Nevertheless, it’s always better to provide a complete solution, as these two customers have pointed out. Ergo, it looks like we’ll be providing a transparent API ’some time soon’, and hopefully we’ll be able to navigate the technical difficulties likely to crop up when deploying transparent and functional API’s simultaneously. More as it happens, etc.

Posted at 30 Jul @ 10:44 AM by user Chas Emerick | comment 0 comments
Founder, Snowtide Informatics

About Me

I'm the founder of Snowtide Informatics; we make PDFTextStream, a PDF text extraction library for Java and .NET that a lot of people like and use. I do a lot of programming in Clojure and just a little in Java, trying to make it easier for people to access data from unstructured content.

You should follow me on twitter here

    Topics

    Archives

    1. 2010
      1. February
      2. January
    2. 2009
      1. December
      2. November
      3. October
      4. September
      5. April
      6. March
      7. February
      8. January
    3. 2008
      1. November
      2. July
      3. May
      4. March
    4. 2007
      1. November
      2. October
      3. April
      4. March
      5. February
    5. 2006
      1. December
      2. October
      3. September
      4. August
      5. January
    6. 2005
      1. September
      2. August
      3. July
      4. June
      5. January
    7. 2004
      1. December
      2. September
    Adaptavist Theme Builder (3.3.5-conf210) Powered by Atlassian Confluence 3.0.2, the Enterprise Wiki.