currently i am using pdf-box to parse pdf document. but some pdf file can not be read/parsed. is there any pdf parser which can be used ?
Hi Paul, thanks.
But seems that iText can not be used to parsed the content of the pdf file. It can only be used for reading the content. i read the tutorial and paste it below :
You can't 'parse' an existing PDF file using iText, you can only 'read' it page per page.
What does this mean?
The pdf format is just a canvas where text and graphics are placed without any structure information. As such there aren't any 'iText-objects' in a PDF file. In each page there will probably be a number of 'Strings', but you can't reconstruct a phrase or a paragraph using these strings. There are probably a number of lines drawn, but you can't retrieve a Table-object based on these lines. In short: parsing the content of a PDF-file is NOT POSSIBLE with iText (not if you want good results: there are ways to retrieve text from an existing PDF).
currently i am using pdf-box but it can't parse for certain type of pdf file.
Thanks so much,
What you want to do may not be possible.
First off, some PDF files are encrypted as a means of copy protection. None of the publically available tools can decrypt and parse these files (doing so would be a violation of US digital copyright laws).
Secondly, there is pretty much no way to "parse" a PDF. The best you can do is to pull out blocks of texts. PDF does not have any real formatting information the way HTML does, so data retrieval is very limited.
Anyway, that is pretty much the extent of my knowledge on the topic. I wish you the best of luck.