Overview

This repository contains files for building classifiers for file type detection. Initially the project was built for developing Deep Neural Networks in order to spit out neural network parameters for Tika to learn.

However, I have built over other classifiers for performing the same functionality. This repository now supports almost 6 classifiers and preprocessors for creating your on dataset.

I highly encourage contributors to develop their own dataset and try out the different classifiers to update the result.

Current Status

We are working on scaffolding in the form of a Flask App that employs a decision tree to classify files and that uses the built model files from the library.

Dependencies

  • Pandas
  • Numpy
  • Theano (for leveraging GPU and building deeper netwoks)
  • Sklearn

Classifiers Supported

  • Decision Tree
  • Neural Network
  • Gaussian NB
  • SVM
  • Random Forest Classifier
  • K-Nearest Neighbor Classifier
  • Gradient Boosting Classifier

Neural Network Results

Mime type Test Accuracy Number of Hidden Layers
application/x-grib 92.34% 2
application/x-grib 94.33% 4
application/xhtml 99.5% 2

Decision Tree Results

Mime type Test Accuracy
application/x-grib 99.76%

Support Vector Machine

Mime type Test Accuracy
application/x-grib 90.85%

Gaussian Naive Bayes

Mime type Test Accuracy
application/x-grib 90.30%

Random Forest Classifier

Mime type Test Accuracy
application/x-grib 99.94%

KNN Classifier

Mime type Test Accuracy
application/x-grib 99.54%

Stochastic Gradient Descent

Mime type Test Accuracy
application/x-grib 98.99

Gradient Boosting Classifier

Mime type Test Accuracy
application/x-grib 99.91

Running the project

Each classifier in the classifier package can be used to train your model. Classifiers follow a simple structure which involves three steps: - build the model - train the classifier - test the classifier

The neural network is special and generates a nnmodel file that can be used with Apache Tika in order to train the NN to work on content based detection and not using Magic Numbers.

Understanding the input file

It is assumed that the input training files have the following format: - First 256 columns correspond to the byte frequency companded using any function you like - The last column is the output column.

Wonderng how to go ahead and create dataset like the one used? The preprocessor contains important constructs that will help generate the dataset needed.