Hello Apache Tika

Apache Tika is nice framework that lets you extract content of file. Example you can extract content of PDF or word document or excel as string. It also lets you extract metadata about the file. For example things like when it was created, author,.. etc. I built this sample application to play with Tika
import org.apache.tika.Tika;
import org.apache.tika.io.IOUtils;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import java.io.*;
public class TikaClient {
//TODO: DOcument how to use TikaClient
public static void main(String[] argv) throws Exception {
if(argv.length !=1){
System.out.println("Usage TikaClient <filepath>");
System.exit(-1);
}
String fileName = argv[0];
Tika tika = new Tika();
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
byte[] fileToByteArray = readByteArray(fileName);
InputStream inputstream = new ByteArrayInputStream(fileToByteArray);
ParseContext context = new ParseContext();
parser.parse(inputstream, handler, metadata, context);
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println("Header " +name + ": " + metadata.get(name));
}
System.out.println("File Content ->" + tika.parseToString(new ByteArrayInputStream(fileToByteArray)));
}
private static byte[] readByteArray(String filePath) throws IOException{
FileInputStream fileInputStream = new FileInputStream(filePath);
return IOUtils.toByteArray(fileInputStream);
}
}
view raw TikaClient.java hosted with ❤ by GitHub
You can try using it by giving it full path of the file that you want to extract.

No comments: