[Dspace-tech] PDF text extraction

Eric Luhrs

2009-02-23 20:05:01 UTC

It took some digging but this issue has been resolved. I am reporting back
to this list because a few people have expressed interest.

At Larry Stone's suggestion, I verified that pdftotext (part of xpdf) was
able to extract text from my scanned PDF. I also re-ORCed the PDFs using
Acrobat 8 Pro, and found that media-filter was able to extract the text with
no problem. Realizing that the problem was with my OCR app (JRA Publish,
which is really great for creating batches of super-small PDF and DjVu files
with highly accurate OCR), I contacted the lead developer and learned that a
similar issue had been reported earlier. It turns out that PDFbox was
looking for the "ToUnicode" flag in the OCRed text, and failing when it was
not found. My copy of JRA Publish was a few years old, but the new version
included the flag that PDFbox needed to extract text from my files.

Along the way, I also found a helpful document in Michigan's DeepBlue
repository that provides some best practices for scanned and born-digital
PDFs. Anyone interested in creating better PDFs should take a look here:

http://deepblue.lib.umich.edu/handle/2027.42/58005

Eric Luhrs
Lafayette College

Post by Eric Luhrs
I just created a collection of 72 PDFs, mostly from scanned image files,
but with several born digital files too. I was disappointed to learn that
PDFbox was unable to process the scanned documents even though they contain
searchable text. The files were created using a third-party OCR tool, but I
am able to copy and paste the text using Acrobat.
I understand that DSpace is limited by what PDFbox is able to process, so
my question is, are there any guidlines for PDF creation to help ensure that
PDFbox can read them? For instance, maybe it only understands certain
versions of the PDF language, or certain types of compression.
Any suggestions? I figured I'd try here before contacting the PDFbox
community.
Eric Luhrs
Lafayette College