Access Keys:
Skip to content (Access Key - 0)
Welcome to Muck and Brass, the Snowtide blog site    

Unicode issue found (Icelandic)

Authored on Sep 10, 2004 06:52 PM by Chas Emerick ; last touched on Sep 10, 2004 06:52 PM

Labels

pdftextstream pdftextstream Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.

A bug has been discovered in the current build of PDFTextStream (v1.2) that can result in some Icelandic characters being outputted improperly.

It has come to our attention that a bug in v1.2 of PDFTextStream may result in some Icelandic characters being outputted improperly. This issue will manifest itself only if:

  • PDFTextStream is configured with strictEncoding set to true (via PDFTextStreamOptions.setUseStrictEncoding(boolean))
  • PDFTextStream is used to extract text and metadata from a PDF containing certain Icelandic characters, including Ð (Eth), ð (eth), Þ (Thorn), and þ (thorn)

We have found the root of the problem, and a fix is being developed. A bugfix release including this fix will be released by the end of this week.

Update: This issue has been resolved, but not by a bugfix release — the issue originally arose because of a malformed PDF document. See this post for the gory details…

Founder, Snowtide Informatics

About Me

I'm the founder of Snowtide Informatics. We make DocuHarvest, a web application that turns your valuable documents into data, and PDFTextStream, a PDF text extraction library for Java and .NET. I do a lot of programming in Clojure and just a little in Java, trying to make it easier for people to make unstructured content just a little more useful.

    Topics

    Archives

    1. 2010
      1. July
      2. June
      3. May
      4. April
      5. March
      6. February
      7. January
    2. 2009
      1. December
      2. November
      3. October
      4. September
      5. April
      6. March
      7. February
      8. January
    3. 2008
      1. November
      2. July
      3. May
      4. March
    4. 2007
      1. November
      2. October
      3. April
      4. March
      5. February
    5. 2006
      1. December
      2. October
      3. September
      4. August
      5. January
    6. 2005
      1. September
      2. August
      3. July
      4. June
      5. January
    7. 2004
      1. December
      2. September
    Adaptavist Theme Builder (3.3.5-conf210) Powered by Atlassian Confluence 3.0.2, the Enterprise Wiki.