Apache Tika - a content analysis toolkit
The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.
Tika is one of the best choices of HTML parser. The other choices includes jsoup etc.
For simple case, you can use regular expression or xpath to get what you want. But for more complicate cases, you need these toolkits.
See also:
Tika is one of the best choices of HTML parser. The other choices includes jsoup etc.
For simple case, you can use regular expression or xpath to get what you want. But for more complicate cases, you need these toolkits.
See also:
Comments
Post a Comment