For my three regular blog readers: much of the content I've posted here is migrating over to cemerick.com, and that's where I'm posting all of my new content. So, you'll be well off to head over there, and update your feed readers to http://cemerick.com/feed/.
EDIT: FYI, this move should not be taken as an indication of anything vis á vis Snowtide or DocuHarvest.
Getting valuable data out of documents should not require an I.T. staff, outside consultants, building or buying software, or an up-front investment of hundreds or thousands of dollars, regardless of how many documents and how much data is involved.
This may seem strange to hear coming from me: you may or may not know that I've been principally involved in selling PDF content extraction software for the past six years. Over that time, I've had the opportunity to come face-to-face with hundreds of content and data extraction challenges across dozens of industries. If there's one takeaway I can offer up from that experience, it's this:
No one cares about the process of data extraction: people only care about their data.
Seems simple enough, but ask anyone who's been involved in any kind of data integration project, or tried to help a nontechnical user get useful data out of a directory full of documents, and you'll know that people are forced to care. The situation is worse for e.g. small business owners and others that simply can't afford additional software and the attendant consulting hours.
DocuHarvest is an alternative path: a web application that provides data extraction and content conversion services through the browser, usable by everyone, costing pennies per document processed. There's even a free option, if you're willing to process only one document at a time.
Available Now
We're starting small, offering three types of document processing jobs:
DocuHarvest currently only accepts PDF documents as input, but that will change relatively soon – PDF just happens to be where we "come from", so we're rolling that out first. Support for additional file formats will come.
In addition, we have a variety of additional types of jobs in the pipeline, including:
conversion of documents to images (rasterization),
extraction of embedded images, and
thumbnail generation
That's hardly a comprehensive list. This is just the beginning; we have a lot of tricks we've saved up over the course of those six years. :-)
If you have any feedback, comments, questions, suggestions, or complaints, don't hesitate to contact me; leave a comment below or in the feedback boxes on the DocuHarvest site, message me (@cemerick) or @docuharvest on Twitter, or email me directly.
If you have to pick, choose function over form (at least when it comes to build tools).
Ahem. Sorry, let's start from the beginning.
Like any group of super-smart programmers using a relatively new language, a lot of folks in the Clojure community have looked at existing build tools (the JVM space is the relevant one here, meaning primarily Maven and Ant, although someone will bark if I don't mention Gradle, too), and felt a rush of disdain. I'd speculate that this came mostly because of XML allergies, but perhaps also in part because when one has a hammer as glorious as Clojure, it's hard to not want to use it to beat away at every problem in sight. Ruby has rake, and python has easy_install, so it seems natural that Clojure should have its own build system that leverages the language's stellar capabilities – "just think of how simple builds could be given macros and such", one might think.
I can sympathize with that perspective, and I admit that I, too, once thought that a Clojure-based build system was an obvious move. This notion runs off the rails pretty quickly for one reason:
You can either help reimplement all of these things – or, if you're lucky enough to have access to a build tool that has a community that has built all these things already, you can use that.
Handily enough, Clojure is a JVM language, so using all of the goodness that's been built up over the years in Maven-land is extraordinarily easy to do. This means you have to write less code, and you get to use more mature, well-tested, well-supported code and tools, allowing you to focus on building awesome Clojure apps, not dicking around with implementing shell invocation, or Java compilation, or deployment via scp, or whatever "simple" build task you need today that's been in Maven's quiver for 5 years.
As if that weren't enough, Sonatype has its Polyglot Maven project, where they are working on making it possible to drive Maven builds from your favorite language, be it Clojure, Ruby, Groovy, or Scala. For now, I stick to using XML POM files (they're incredibly well-supported by tons of JVM-land tools – code completion on dependency version numbers FTW); while I love s-expressions, I'm too happy to trade off a pinch of syntactic elegance in exchange for tons more capability.
If you're going to use Maven for your Clojure builds, here's some links:
Please make sure you check out the documentation on clojure-maven-plugin, which is where all of the Clojure-specific goals come from.
You'll do yourself a world of good by keeping the Maven books ready at hand (not the old one published years ago, BTW, the newer ones available online or throughlulu). Yup, there's a lot of material there. No, you don't need to know it all to become super-productive with Maven.
We're building a web service for which we aim to charge money. Further, the data being pushed around may be confidential or otherwise of a sensitive nature. We have good reasons to do everything we can to ensure that the service is secured "properly":
We don't want to have customers charged for work that is requested by a bad actor exploiting a security hole (of course, we'd issue a refund and an apology in such a case, but the impact to our business through unnecessary processing could be sizable).
We don't want our customers' data exposed; common vectors for this include sniffing, replay attacks, or simply the use of compromised credentials.
Of course, the impact on our relationship with our customers due to any security breach could be significant and devastating – to our business, our reputation, and potentially even to our customers' affairs completely outside of their use of our web service. So again, we have a lot of reasons to be highly-motivated when it comes to security.
By way of context, let's set the stage with regard to the moving pieces. The web service in question:
is built on a JVM stack (with the application itself built with Clojure, of course, using the Compojure framework)
has a user-facing, HTML browser interface as well as a "RESTful" API surface ("RESTful", as in, pretty darn close to ROA "style", so the set of URIs involved in delivering the user-facing interface vs. those delivering the REST API are nearly identical).
the user-facing interface offers standard form-based authentication, as well as OpenID authentication (which will be recommended only for more casual users and usage).
will always, always delivered over SSL. We assume that every bit of data transferred is confidential, so cleartext is an absolute no-no.
OK, let's go find an expert
It is with this mindset that I've been digging into how to approach web service security. Note that I'm no specialist or expert in this area – I'm merely a practitioner that is usually focused on things far, far away from anything security-related. (It may not surprise you that I'm coming to appreciate that fact more and more as I learn about the "state of the art" in web service security.)
Given this, I set out a few weeks ago to see where things stand on the web service security front. Of course, that realm is just as full of cliques and posturing and strawmen and ad hominem attacks as the broader software development world is, so finding a clear path forward is not easy. First, a bit of literature review, as it were, drawn in particular from a flurry of web service security chatter a few years ago (emphasis here and there is mine, I wish I had noticed and grokked the indicated bits earlier, I'll explain below):
I started by finding Gunnar Peterson's pair of posts where he compares "REST security" with WS-Security stuffs, where the former (especially approaches like HTTP Basic authentication over SSL) come out sounding like a pretty bad choice:
people who say REST is simpler than SOAP with WS-Security conveniently ignore things like, oh message level security
Now if you are at all serious about putting some security mechanisms in to your REST there are some good examples [such as Amazon's implementation of an HMAC authentication scheme].
Some people in the REST community are able to see the need for message level security so this is heartening somewhat. If the data is distributed and the security model is point to point (at best), we have a problem.
In summary, RESTful security, that is SSL and HTTP Basic/Digest, provides a stable and mature solution that addresses transport level credential passing, encryption, and integrity. It is ubiquitous, simple, and interoperable. It requires no out-of-band contract negotiation or a priori knowledge of how the resource (okay, service) is secured. It leverages your existing security infrastructure and expertise. And it addresses 99% of the use cases you are likely to encounter. SSL does not support message level security, and if that’s a requirement, then leveraging SOAP and WSS makes sense.
I am no way suggesting there is only way to do this or that WS-Security came down on stone tablets. I am also not suggesting that a NSA level of security is appropriate for Google Maps. There are many shades of gray. “good enough” security is a big challenge, and it isnt about black and white security models, it is about risk management
From Bill de hÓra:
I think this is where quantative analysis comes in and a measured assessement of the risk is taken. What has to be protected and what’s the worthwhile cost of doing so? Being software people, that’s beyond the general state of the art. We do gut feelings, flames and opinions.
There's a variety of "REST security 'best practices'" posts out there, but a question from StackOverflow links to a variety of additional discussions there that serve as good an indication as any that the accepted way of securing REST web services is Basic auth over SSL.
And now for a bit of hyperbole Before moving on, I just want to point out that Bill de hÓra's comment above is sadly representative of so many corners of software development. Let's ponder that for a moment, while realizing that modern society and its continuation absolutely depends upon the software we build (I'm talking collectively, here).
Take a deep breath
Of course, the above is not an exhaustive survey, just the best tidbits I found over the course of a lot of browsing and searching. Here's the upshot, as I see it:
WS-Security et al. ostensibly provide message-level security that ensures that your service can be passed along by untrusted intermediaries.
Standard HTTP authentication (generally Basic) over SSL transport is the de facto standard for securing REST services, but it does nothing for you if message security is important.
More sophisticated authentication mechanisms are available – in particular HMAC, as exemplified by Amazon's web services – which allow services to ensure that a message's author has not been impersonated. This would resolve the potential holes of .
Unfortunately, I didn't grok the whole message vs. transport security issue as quickly as I should have, where SSL provides the latter but the former would only be satisfied by something like WS-Security (again, ostensibly, I certainly can't vouch for it) or HMAC-SHA1 if one were working in a REST environment. If I had come to grips with that point of tension earlier, I would have arrived at my two conclusions much faster:
In our situation, message security is simply not relevant. As Peterson wrote (and I quoted above) "If the data is distributed and the security model is point to point (at best), [REST has] a problem." Well, in our case, data is not distributed, it is transmitted point-to-point (between our customers and us, a third-party external web service), so transport security provided by SSL should be sufficient.
Here's the biggie: assuming we support form-based authentication (of course, over SSL) for browser-based UI interaction, supporting anything more sophisticated than HTTP Basic authentication over SSL for our REST API interactions would be a waste of resources. We could go full-tilt and require HMAC-SHA1 for the REST API or provide only a SOAP API that used WS-Security (and whatever else goes into that), but that would mean nothing if an attacker has the "REST API" provided for browser use available to him. Given this, transport security provided by SSL, and that alone, is simply all we can do. Put another way: when browser-level security mechanisms improve, then so will our APIs'.
An alternative path would be to host a parallel service, available via a REST API secured via HMAC-SHA1 or a WS-Security-enabled SOAP API, that did not provide any kind of browser-capable entry point. Customers could opt into this if they thought the tradeoff was important. Doing this would be technically trivial (or, perhaps only moderately difficult w.r.t. the SOAP option ), but I've no idea whether the additional degree of security provided by such a parallel service would be of any interest to anyone.
By the way, if I'm totally blowing this, and my conclusions are completely broken, do speak up.
Coming soon: Part II of my investigation/thinking on the subject of web service security, related to OpenID and the management of credentials in general...which should give me all sorts of new opportunities to say foolish things!
I find myself slipping back into web development in the new year. I've known this was coming for some time, so I've had a fair chance to carefully choose my weapons:
Jetty (during development, anyway; we'll see about production)
What has really tied this all together is Maven (and a couple of plugins for it), which has enabled me to fill in a couple of gaps in what is otherwise the most pleasant web development environment I've ever used (where Pylons was the prior champ, FWIW).
The biggest gap is in automatic application reloading/redeployment – in concrete terms, when I save a Clojure source file, my application should be reloaded nearly immediately, thereby avoiding any code-build-deploy cycle. To be precise, this capability is built into Jetty (as it is in many other Java-based app servers). The question is, how to most readily utilize it.
I came across this post by Jim Downing, which describes how to set up a Maven project for a Compojure application, enabling development-mode app reloading using the maven-jetty-plugin (the formatting on that post appears to have degraded since it was published; you can check out the project described in the post here). This certainly appears to fit the bill; unfortunately, the setup that Jim describes there doesn't quite work for me – when I save a source file, the application is automatically redeployed, but no changes are picked up.
Thankfully, the fix is easy. Below is the relevant section of my pom.xml, configuring maven-jetty-plugin to add my Clojure source root as an extra classpath element. This allows Clojure, running in the jetty application server, to find and load any Clojure source files that are newer than their AOT-compiled counterparts in the usual target/classes directory (note the webAppConfig/extraClasspath elements):
With that, I'm just a mvn jetty:run away (or, really, a single click away in NetBeans) from having a development process identical to paster serve --reload, with the added benefit of Clojurey goodness.
If you want to compile Clojure code (and really, if you're involved in a project of any size or importance, you should be, if only to avoid forcing Clojure to generate bytecode at runtime, which will slow down the sort of rapid development enabled by automatic app redeployment as describe above), do me a favor and use clojure-maven-plugin. (The post I reference above manually invokes the Clojure compiler using ant's exec task, but that was what you had to do back in July 2009.) It's a great piece of kit, and additionally serves as a perfect gateway drug to Maven – which, despite the controversy, and my own quibbles with various aspects of it, will eventually save your bacon in any larger project.
Over the past month, I've been gradually porting all of our projects' builds from Ant to Maven. Everything's gone swimmingly, especially given the excellent clojure-maven-plugin, which allowed me to cleave off all of our comparatively complicated ant scripts for building and testing Clojure code. One part that did require some work was the porting of the builds associated with our NetBeans Platform-based applications – so, I thought I'd post a couple of hints to help others over the rough spots.
A plug for NetBeans We've had a good deal of success in using the NetBeans Platform recently (often referred to as the NB RCP). It provides a metric ton of fairly high-quality plumbing for thick-client applications, and definitely saved our asses in a couple of key areas insofar as we've been able to reuse large pieces of the Platform, essentially unchanged, to meet critical new requirements. Of course, that's why we chose to use it in the first place.
Extemporaneous and Lengthy Background
To be clear, the rough spots in question aren't associated with the actual Mavenization of the NetBeans Platform-based projects – that's a relatively straightforward affair, with archetypes available in the NetBeans IDE to get one started, and very well-documented goals available, all provided by the NBM Maven Plugin. Given an existing ant-based build process, I found the actual porting of the build fairly straightforward.
The dicey part had to do with having a set of Platform artifacts available to build against. Under the ant-based build regime, it was common for those building on top of the NB RCP to keep a set of RCP artifacts available in every build environment. This was always a pain (for potentially-obvious reasons that I don't really want to get into now), and the general non-composability of the ant-based build process drove NB RCP users (and the Platform developers themselves) to extremelengths of hacking to get stuff working properly. (BTW, just so everyone knows, I'm not picking on Fabrizio here – he's just the one who appears to have pushed the envelope more than anyone else vis á vis improving the composability of the ant-based RCP build process.)
One great thing about the NBM Maven Plugin is that it cuts this knot quite elegantly, making it possible to treat NetBeans Modules (NBMs) as first-class citizens within the maven world. So, if you have a maven repository that contains NBMs (like this one hosted by the NetBeans folks themselves), you can readily add NBM dependencies just like you would jar dependencies from maven central:
...and the NBM plugin will take care of using those NBM dependencies as appropriate:
injecting the NBMs' associated jars into the project's compile classpath
adding the NBMs as runtime dependencies of whatever NBM(s) your project/application produces
adding the NBMs to the (optional) "update site" associated with your NB RCP application (making remote updating of that application in the field trivial)
And, to complete the cycle, the nbm-maven-plugin provides a nbm packaging type, so that you can build NBMs independently, deploy them as you'd expect, and then compose them without any ceremony into however many NB RCP applications you'd like. No suite-chaining, no special platform or cluster artifacts in every build environment, nothing at all different from what one is used to in any other jvm/maven environment.
The Rough Spot
All of the above works flawlessly (at least it has for me in my ~month of usage). The key prerequisite though, is having access to a repository that contains the Platform NBMs that you'd like to use. The repository that I linked to above does not track NetBeans releases in lockstep (e.g. at the time of this posting, the http://bits.netbeans.org/maven2 repo has NBMs from NetBeans v6.5 and v6.7, but not v6.7.1, or the recently-released v6.8). The solution is to populate your own maven repository with those NBM artifacts.
Deploying NetBeans Platform artifacts to your own repository
This might have been a tedious process, were it not for another handy goal from the NBM Maven Plugin, populate-repository, which will push all of the artifacts produced by a NetBeans Platform build (the NBMs themselves, their sources, javadoc, and appropriate non-NetBeans dependency metadata) into your own maven repository.
There's a fair bit of configuration and setup that goes into this though. A HOWTO is provided by the nbm-maven-plugin project, but there are a number of things that it leaves unspoken. So, here's a dump of what I did to successfully populate a Nexus maven repo with a full set of NetBeans Platform artifacts:
Pull the NetBeans Platform sources from the associated hg repo (I used the release68 repo, as we're targeting v6.8 of the NB RCP now). It appears that populating your repo with NB RCP artifacts from a binary download is possible, but then you'll not have the associated javadoc, source artifacts, etc.
Build the entire project – I'm sure it's possible to restrict the build to certain clusters, but I don't see any reason to optimize this process since doing so only saves a little bit of disk.
You must set your JAVA_HOME environment variable to point to a Sun JDK, especially in linux environments that often come with non-Sun JDKs (I'm looking at you, Ubuntu, with your cute gcj JDK). Not doing this will result in very strange compilation errors.
You must set your ANT_OPTS environment variable to specify a higher-than-default maximum heap (export ANT_OPTS=-Xmx1024m worked for me).
Within the top-level of your NetBeans Platform source checkout, run ant; ant nbms build-source-zips build-javadoc – this will build everything you care about in order to populate your maven repo.
You want to have the NBMs in your repository to have appropriate dependency relationships established with third-party artifacts, right? Achieving this is easy if you have Nexus:
unzip sonatype-work/nexus/storage/central/.index/nexus-maven-repository-index.zip somewhere (I used /tmp/nexus-index).
set the nexusIndexDirectory property in the last step to that the path where you unzipped central's index; the nbm-maven-plugin will search that Lucene index to find dependencies referred to within the Platform's NBMs
set MAVEN_OPTS to specify a higher-than-default maximum heap (export MAVEN_OPTS=-Xmx512m worked for me). I'm not sure why this would be required, but I got OutOfMemoryErrors with max heap set to anything less than 512MB. Perhaps searching the maven central repo index is what pushed allocation so high.
Decide on a version number for the deployed artifacts, and use it as the value of the forcedVersion property. I used RELEASE68 to go along with the pattern established at http://bits.netbeans.org/maven2; 6.8 makes more sense to me, but if/when the NetBeans maven repo comes up to date with the NetBeans release schedule, sticking with their convention will allow us to use that authoritative repository with no changes to our projects.
Assuming you're deploying to a release repository, make absolutely sure that you've (temporarily) enabled redeployment for that repository! nbm-maven-plugin deploys some NBMs multiple times (presumably while traversing various dependency graphs), and not enabling redeployment will result in errors (400 errors from Nexus, specifically – I can't say what might happen with different repository managers).
Now for the big finish: mvn org.codehaus.mojo:nbm-maven-plugin:3.1:populate-repository -DforcedVersion=RELEASE68 -DnetbeansInstallDirectory=nbbuild/netbeans -DnetbeansSourcesDirectory=nbbuild/build/source-zips -DnexusIndexDirectory=/tmp/nexus-index -DnetbeansJavadocDirectory=nbbuild/build/javadoc -DnetbeansNbmDirectory=nbbuild/nbms -DdeployUrl=<nexus_repo_url> -DskipLocalInstall=true
Whew! Let that sucker run for a while, and you should be left with a maven repository fully populated with NetBeans Platform artifacts.
Talk to anyone outside of the software world, and you'll quickly realize that one of the most gut-wrenching, anxiety-inducing acts is buying software. Even if one has evaluated the product in question top to bottom, past experience of bugs, botched updates, missing features, and outright failures and crashes has tempered any enthusiasm or confidence that might be felt when the time comes to pull out the credit card or write the purchase order.
Of course, the blame for this lies squarely with the software industry itself – the failures in software quality are well known, both discrete instances as well as in aggregate. Those of us whose business and livelihood are tied to the sale of software (whether sent out the door or delivered as a service) must do whatever we can to reverse this zeitgeist.
Given that, we've decided to adopt a very simple, no-nonsense "Satisfaction Guaranteed" policy for PDFTextStream. Hopefully this will help take the anxiety out of someone's day, somewhere.
This isn't a new idea, of course. Lots of software companies have had guarantees of some sort or another for ages, but I think my first encounter with the concept as a business owner was Joel Spolsky's post from a couple of years ago:
I think that our customers are nice because they’re not worried. They’re not worried because we have a ridiculously liberal return policy: “We don’t want your money if you’re not amazingly happy.”
Joel raised the issue again on a recent StackOverflow podcast, which prompted me to think about our own approach...
What do we do about unhappy customers?
To be honest, our customers are pretty happy. Of course, we occasionally receive a bug report, but we generally knock out patches within a couple of days, and sometimes faster. In the 5 years we've been selling PDFTextStream, we've never had a single request for a refund. Part of that is offering up a very liberal evaluation version, but I'd like to think it's because what we sell does the job it's meant to do very well.
Given that, I've never thought to make a big stink about a refund policy – it just never came up. But hearing Joel and Jeff talk about the ire that they felt towards various companies that refused to issue refunds when they weren't happy with something motivated me to make our de facto policy explicit. Thus, the new "Satisfaction Guaranteed" statement.
Part II: the Open Source Influence
An elephant in the room is the influence of open source software on customers' attitudes towards buying software, and the assessment of risk that goes along with it. As more and more users of technology (just to spread the net as widely as possible) are exposed and become accustomed to the value associated with open source software (which, in simple terms, is generally high because of its zero or near-zero price), it increases pressure on commercial vendors (like us) to up our game along the same vector.
But, the impact of open source software on pricing is a pretty stale story. The real impact is derivative, in that a zero or near-zero price means that the apparent risk associated with using open source software is zero or near-zero. The promise of proprietary, commercial software is that, if it does what the vendor claims (whatever that is), then that software will deliver benefits far in excess of its cost and far in excess of the aggregate benefit provided by the open source alternatives, even given the price differential.
The problem is that a lot of people only turn towards commercial options as a last resort because of the aforementioned historical failures of the software industry vis á vis quality: the apparent risk of commercial options is higher than that associated with open source options, simply because the latter's super-low price is a psychological antidote to any anxiety about quality issues. So, there's flight towards low-priced options, rather than a thorough search for optimal solutions. Injecting an explicit guarantee of performance and reliability (like our new "Satisfaction Guarantee") might be enough to tip the relative apparent risk in favor of the commercial option – or, at the very least, minimize the imbalance so that it's more likely that price won't dominate other factors (which are potentially more relevant to overall benefits).
Of course, this can only work if one's product is actually better than the open source alternatives, and by a good stretch to boot so as to compensate for the price differential. In any case, it's a win-win for the formerly-anxious software user and buyer: they should feel like they have more choice overall, and therefore have a better chance of discovering and adopting the best solution for any given problem, regardless of software licenses and distribution models.
Git submodules are a relatively decent way to compose multiple source trees together, but they definitely fall short in a number of areas (which others have discussed at length elsewhere). One thing that immediately irritated me was that there is no way to recursively update, commit, push, etc., across all of one's project's submodules. This is something I ran into immediately upon moving to git from svn some months back, and it almost scared me away from git (we used a lot of svn:externals, and now a lot of git submodules).
Thankfully, the raw materials are there in git to work around this. (I've since noticed a bunch of other attempts to do similar things, but they all seem way more complicated than my approach...maybe it's the perl? ;-))
Here's the script we use for operating over git submodules recursively:
git-submodule-recur.sh
#!/bin/sh
case "$1" in
"init") CMD="submodule update --init" ;;
*) CMD="$*" ;;
esac
git $CMD
git submodule foreach "$0" $CMD
Throw that into your $PATH (I trim the .sh), chmod +x, and git submodules become pretty pleasant to work with. All this is doing is applying whatever arguments you would otherwise provide to git within each submodule, and their submodules, etc., all the way down. The one special invocation, git-submodule-recur init, just executes git submodule update --init in all submodules.
So, want to get the status of your current working directory, and all submodules? git-submodule-recur status Want to commit all modifications in cwd and all submodules? git-submodule-recur commit -a -m "some comment" Want to push all commits? git-submodule-recur push You get the picture.
Note Starting in git 1.6.5, git submodule will grow a --recursive option for the foreach, update and status commands. That's very helpful for the most common actions (and critical for building projects that have submodules in CI containers like hudson), but git-submodule-recur definitely still has a place IMO, especially for pushing.
This script has saved me a *ton* of typing over the past months. Hopefully, it finds a good home elsewhere, too.
Edited 2009/09/28 I tweaked the git-submodule-recur script to quote the path to the script ("$0" instead of $0); this became necessary when I dropped the script into C:\Program Files\Git\bin in our Windows-hosted Hudson environment.
Consider: up until last week, I was simply using this space every now and then for some relatively bland navel-gazing related to selected goings-on at Snowtide. Then, a friend of mine decided to put my most recent post (probably the only potentially inflammatory post I’ve ever written) on reddit, and a variety of people weren’t very happy (both in comments to the post itself, on reddit’s comment page, and to a lesser extent on a Joel On Software thread). For someone who can lay only a tenuous claim to being a blogger (never mind the title of A-, B-, C-, or D-list blogger!), it’s been an interesting experience to say the least.
I tried to participate in the discussions that were swirling around, but eventually the comments became too numerous for me to follow in a timely way given the amount of bandwidth I’ve allocated to such things. So, I’m taking the easy/cheap way out with a response post. I know this is frowned upon by many, but c’est la vie. Here, I will respond in two parts:
In reading over all of the commentary, there seem to be three types of responses:
Response Type A: Any lack of growth/”innovation” or a slowing of such growth in Python is good — stability makes it easier to concentrate on customer solutions, and encourages robust library development.
Regardless of your language or platform, if stability and operational continuity is an overriding interest of yours, then lock yourself into a particular build, and stay there as long as you want. This is a significant part of the job of IT organizations in large organizations – to standardize on environments and tools so as to shield the organization from unwanted change and cost.
(As an aside, Ruby’s Matz provides a positive spin on the “Python is stable, and that’s good” attitude, which may or may not be cheeky [it's hard to tell through the translation]: “Perhaps Python has a sense of responsibility.”)
Response Type B: The “significant improvements” I’d like to see in Python are (take your pick): of academic use only; are overhyped genius toys that only make it easier to build overly complex solutions; distractions from other improvements that would be immediately useful to the majority of the Python userbase.
This attitude pops up frequently in any discussion of programming paradigms that are off the beaten track, any technique that is unfamiliar to the commenter, or any anything that the commenter has had problems with in the past. Meek typified this kind of response with:
Python is not growing because you want programmable syntax and “esoteric” features? Features that 99% of software developers shouldnever use. Let me guess, you have never maintained a project written in a language that supports programmable syntax where geniuses abuse meta-programming where simpler alternatives achieve the same goal.
This is a particularly disturbing line of thought, and one that I had always considered to be antithetical to a central principle of Python (at least in my eyes), that the programmer should always be trusted. I’ve always associated this with a variety of Python features, including duck typing, the lack of access controls around class members (modulo the slightly perverse double-underscore notation and associated name mangling of “private” attributes), the composability of namespaces, etc.
Lots of programming features are “esoteric”, depending on who you ask. Pointer access is esoteric to a web app developer and should never be used in such a context, but it’s critical to a C-language device driver programmer. Any number of language features can simultaneously be considered esoteric by some and necessary by others. Not recognizing this, and then implying that “simpler alternatives” could readily take the place of those “genius” toys is evidence of a lack of perspective. 28/w in the JOS thread makes my point better than I ever could:
It’s precisely because I want my projects to be on time that I don’t use assembly language for everything. That’s the same reason I don’t use C++ either. I’m about 5x as productive in Ocaml as I am in C++ i.e., at least 80% of my time spent coding C++ is spent dealing with language issues; it’s the equivalent of spending time making all my function calls out of gotos.
Most likely, 80% of my time coding Ocaml is wasted too, and I just don’t know it.
Bottom line: just because you don’t see a use for a particular language feature doesn’t mean that someone else doesn’t find it absolutely, positively necessary.
Response Type C: Python is growing, and if you were to pay attention, you’d notice. We’re just not working on what you want.
This point has been made by a variety of people, but I should give special attribution to Phillip J. Eby, since he’s a significant Python contributor:
Um, so you don’t think the “with” statement and coroutines were new features?
What about the new metaclass hook that’ll be in Python 3.0 (and maybe 2.6)? It’s actually a pretty significant step forward for implementing Ruby-like DSL’s in Python.
I suppose this is the nut of the problem, at least as far as this discussion has related specifically to the technical aspects of Python: I’m not bowled over by the improvements Phillip cites. They’re very useful and handy to the vast majority of Python programmers, but they’re not game-changers (which I suppose is what I meant by “significant growth”). I think the description of the metaclass hook as “a pretty significant step forward for implementing Ruby-like DSL’s in Python” is very telling. The facilities for building DSLs in Ruby are good in so far as they make it possible to get the job done, but they’re by no means conceptually complete nor functionally clean (as pointed out by jerf in the reddit comments), so taking a “significant step” towards implementing such facilities isn’t the whole ballgame.
Regardless of that detail, the point is that progress is being made in Python — just not in the vector I need. And, that’s OK. Which brings me to…
Sandbox Etiquette
After all has been said and done, my original post was a mistake, in that I exhibited a similar type and degree of technological selfishness as those who replied with Type A responses. As some of my friends will attest, I’ve personally been unhappy with Python and its direction for a variety of reasons for months now, especially as I’ve sunk further and further into a class of problems for which Python isn’t particularly well-suited at the moment. While I had settled on that conclusion some time ago, I’ve obviously been suffering from a mental block that caused me to do drive-bys against Python. This came to a head with my blog post.
The more mature (and zen) thing to do would have been to simply go looking for a different sandbox, and leave well enough alone with regard to Python. (It is, after all, a fantastic language and will likely remain my favorite for most common tasks [especially web programming] for a some time hence). This is especially true given the fact that I am essentially a nobody in the Python community – I’ve contributed in my own small ways, but it’s not like I’m a core hacker or important library author. Instead, I adopted the Response Type A attitude, but flipped it on its head, claiming that my favorite language should advance itself to suit my requirements, and to hell with the priorities of others.
So, let’s make a deal: I’ll stop sniping on Python, and maybe everyone else can stop making clever comments about “esoteric” language features. Then we can all spend more time building bigger and better sandcastles.
I'm the founder of Snowtide Informatics. We make DocuHarvest, a web application that turns your valuable documents into data, and PDFTextStream, a PDF text extraction library for Java and .NET. I do a lot of programming in Clojure and just a little in Java, trying to make it easier for people to make unstructured content just a little more useful.
FYI, I'm now doing all my writing over at http://cemerick.com, so go there for my new stuffs. I'll be migrating much of the content from here over there gradually/eventually.
I host an occasional podcast you might want to check out:
Strictly Professional » Podcasts (Podcasting about programming, the business of software, and otherwise making fools of ourselves.)