Eric Luhrs
2009-02-17 22:05:18 UTC
I just created a collection of 72 PDFs, mostly from scanned image files, but
with several born digital files too. I was disappointed to learn that
PDFbox was unable to process the scanned documents even though they contain
searchable text. The files were created using a third-party OCR tool, but I
am able to copy and paste the text using Acrobat.
I understand that DSpace is limited by what PDFbox is able to process, so my
question is, are there any guidlines for PDF creation to help ensure that
PDFbox can read them? For instance, maybe it only understands certain
versions of the PDF language, or certain types of compression.
Any suggestions? I figured I'd try here before contacting the PDFbox
community.
Eric Luhrs
Lafayette College
with several born digital files too. I was disappointed to learn that
PDFbox was unable to process the scanned documents even though they contain
searchable text. The files were created using a third-party OCR tool, but I
am able to copy and paste the text using Acrobat.
I understand that DSpace is limited by what PDFbox is able to process, so my
question is, are there any guidlines for PDF creation to help ensure that
PDFbox can read them? For instance, maybe it only understands certain
versions of the PDF language, or certain types of compression.
Any suggestions? I figured I'd try here before contacting the PDFbox
community.
Eric Luhrs
Lafayette College