Hi all,
currently i am using pdf-box to parse pdf document. but some pdf file can not be read/parsed. is there any pdf parser which can be used ?
Thanks b4,
Felix
-
PDF Parser (3 messages)
- Posted by: Felix Yunius
- Posted on: June 16 2004 03:54 EDT
Threaded Messages (3)
- PDF Parser by Paul Strack on June 16 2004 13:13 EDT
- PDF Parser by Felix Yunius on June 16 2004 20:29 EDT
- PDF Parser by Paul Strack on June 17 2004 11:44 EDT
- PDF Parser by Felix Yunius on June 16 2004 20:29 EDT
-
PDF Parser[ Go to top ]
- Posted by: Paul Strack
- Posted on: June 16 2004 13:13 EDT
- in response to Felix Yunius
Try iText -
PDF Parser[ Go to top ]
- Posted by: Felix Yunius
- Posted on: June 16 2004 20:29 EDT
- in response to Paul Strack
Hi Paul, thanks.
But seems that iText can not be used to parsed the content of the pdf file. It can only be used for reading the content. i read the tutorial and paste it below :
PdfReader
You can't 'parse' an existing PDF file using iText, you can only 'read' it page per page.
What does this mean?
The pdf format is just a canvas where text and graphics are placed without any structure information. As such there aren't any 'iText-objects' in a PDF file. In each page there will probably be a number of 'Strings', but you can't reconstruct a phrase or a paragraph using these strings. There are probably a number of lines drawn, but you can't retrieve a Table-object based on these lines. In short: parsing the content of a PDF-file is NOT POSSIBLE with iText (not if you want good results: there are ways to retrieve text from an existing PDF).
currently i am using pdf-box but it can't parse for certain type of pdf file.
Please advise..
Thanks so much,
Felix -
PDF Parser[ Go to top ]
- Posted by: Paul Strack
- Posted on: June 17 2004 11:44 EDT
- in response to Felix Yunius
What you want to do may not be possible.
First off, some PDF files are encrypted as a means of copy protection. None of the publically available tools can decrypt and parse these files (doing so would be a violation of US digital copyright laws).
Secondly, there is pretty much no way to "parse" a PDF. The best you can do is to pull out blocks of texts. PDF does not have any real formatting information the way HTML does, so data retrieval is very limited.
Anyway, that is pretty much the extent of my knowledge on the topic. I wish you the best of luck.