Discussion:
[Dspace-tech] Searching of text from PDF files
Gary Browne
2009-09-01 05:55:11 UTC
Permalink
Hi all,

I have a query about searching of pdf documents which I can't seem to
find a definitive answer for:

When a user searches via the dspace web interface, is the search run
across the content of text pdfs or just the metadata? If so, does the
pdf submitted to the repository need to have been previously OCR'd, or
does the repository attempt to extract & index text from all pdfs?

Any information regarding this would be greatly appreciated.

Thanks
Gary


Gary Browne
Development Programmer
Library IT Services
University of Sydney
ph: 9351-5946
Sent from my plain old desktop computer.
Mark H. Wood
2009-09-09 20:45:37 UTC
Permalink
Post by Gary Browne
When a user searches via the dspace web interface, is the search run
across the content of text pdfs or just the metadata? If so, does the
pdf submitted to the repository need to have been previously OCR'd, or
does the repository attempt to extract & index text from all pdfs?
DSpace doesn't include OCR code.

The full-text extractor (which feeds the indexing) requires actual
coded-character text in the PDF to work with. If all you have is a
bag of bitmaps (such as you often get from scanning paper documents
into PDF) then they contain nothing useful to extract; you'll need to
OCR or otherwise recover the character data before ingesting the file
into DSpace.
--
Mark H. Wood, Lead System Programmer ***@IUPUI.Edu
Friends don't let friends publish revisable-form documents.
Vishal Kakapuri
2009-09-10 16:39:31 UTC
Permalink
1) pdf is an image - needs to be ocr'd - then uploaded - metadata
filtermedia will try to extract the text out of the pdf and save it as
a text file along with the pdf files..--> search happens on the
extracted text
OR
2) pdf is an text - to be uploaded - metadata filtermedia will try to
extract the text out of the pdf and save it as a text file along with
the pdf files.. --> search happens on the extracted text

3) indexing is on metadata only.
Post by Mark H. Wood
Post by Gary Browne
When a user searches via the dspace web interface, is the search run
across the content of text pdfs or just the metadata? If so, does the
pdf submitted to the repository need to have been previously OCR'd, or
does the repository attempt to extract & index text from all pdfs?
DSpace doesn't include OCR code.
The full-text extractor (which feeds the indexing) requires actual
coded-character text in the PDF to work with.  If all you have is a
bag of bitmaps (such as you often get from scanning paper documents
into PDF) then they contain nothing useful to extract; you'll need to
OCR or otherwise recover the character data before ingesting the file
into DSpace.
--
Friends don't let friends publish revisable-form documents.
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
DSpace-tech mailing list
https://lists.sourceforge.net/lists/listinfo/dspace-tech
Loading...