TrisZaska's Machine Learning Blog

[Intro to ML] - Data mining play around with Scikit-learn

1. Project Introduction

In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, there was a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives. In this project, I will play detective, building a person of interest identifier based on financial and email data made public as a result of the Enron scandal.[1]

2. Step by step to achieve goal [2]

Task 0: Import needed library Task 1: Support functions
1.1 Available support functions
  • Because the available support function is quite long, so I hosted them on Github repository at here  
1.2 My support functions
  • Similar to my support functions, source code available at here
Task 2: Select what features for using Task 3: Remove outliers
3.1 Visualize data to find outliers
3.2 Remove outliers from data
Task 4: Prepare new features for learning model
4.1 Using cross-validation to split data 4.2 Apply Principle Component Analysis (PCA)
4.2.1 Reduce dimensionality of data 4.2.2 Visualize data after reduced dimensionality
Task 5: Try a variety of classifiers
5.1 Gaussian Naive Bayes
5.1.1 Brief Introduction
Naive Bayes[6] methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features. Given a class variable \(y\) and a dependent feature vector \(x_1\) through \(x_n\), Bayes’ theorem states the following relationship:
Using the naive independence assumption that
for all i, this relationship is simplified to
GaussianNB implements the Gaussian Naive Bayes algorithm for classification. The likelihood of the features is assumed to be Gaussian:
Advantages and Disadvantages of Naive Bayes [7]
Advantages:
  • Easy to implement
  • Requires small amount of data to estimate the parameters
  • Good results obtained in most of the cases
Disadvantages:
  • Assumption: class conditional independence, therefore lost accuracy
  • Practically, dependencies exist among variables
  • Dependencies can not be modeled by Naive Bayes Classifier
5.1.2 Training 5.1.3 Visualization
5.2 Support Vector Classification
5.2.1 Brief Introduction
A support vector machine[8] constructs a hyper-plane or set of hyper-planes in a high or infinite dimensional space, which can be used for classification, regression or other tasks. Intuitively, a good separation is achieved by the hyper-plane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier.
Advantages and Disadvantages of SVM
Advantages:
  • Effective in high dimensional spaces
  • Still effective in cases where number of dimensions is greater than the number of samples
  • Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient
  • Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels
Disadvantages:
  • If the number of features is much greater than the number of samples, the method is likely to give poor performances
  • SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation
5.2.2 Training 5.2.3 Visualization
5.3 Decision Tree
5.3.1 Brief Introduction[9]
Trees are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
Advantages and Disadvantages of Decision Tree
Advantages:
  • Simple to understand and to interpret. Trees can be visualized
  • Requires little data preparation. Other techniques often require data normalization, dummy variables need to be created and blank values to be removed
  • Able to handle both numerical and categorical data
  • Able to handle multi-output problems
Disadvantages:
  • Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting
  • Decision trees can be unstable because small variations in the data might result in a completely different tree being generated
5.3.2 Training 5.3.3 Visualization
5.4 Adaboost Classifier
5.4.1 Brief Introduction
The core principle of AdaBoost[10] is to fit a sequence of weak learners (i.e., models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction. The data modifications at each so-called boosting iteration consist of applying weights \(w_1\), \(w_2\), ..., \(w_N\) to each of the training samples
[http://www.ieev.org/2010/03/adaboost-haar-features-face-detection_22.html]
Advantages and Disadvantages of AdaBoost Classifier [11]
Advantages:
  • Unlike other powerful classifiers, such as SVM, AdaBoost can achieve similar classification results with much less tweaking of parameters or settings
Disadvantages:
  • AdaBoost can be sensitive to noisy data and outliers
5.4.2 Training 5.4.3 Visualization
5.5 Random Forest Classifier
5.5.1 Brief Introduction
In Random Forest[12], each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.
[https://www.youtube.com/watch?v=ajTc5y3OqSQ]
Advantages and Disadvantages of Random Forest Classifier [13]
Advantages:
  • It is one of the most accurate learning algorithms available. For many data sets, it produces a highly accurate classifier
  • It runs efficiently on large databases
  • It can handle thousands of input variables without variable deletion
Disadvantages:
  • Random forests have been observed to overfit for some datasets with noisy classification/regression tasks
5.5.2 Training 5.5.3 Visualization
Task 6: Tune classifier to achieve better model
Snippet code of this section on available at here
Here are the result we obtained,
Task 7: Model Evaluations
7.1 Classifier Accuarcy Comparison
7.2 Confusion Matrices
7.3 Classification Report

3. References

[1] https://classroom.udacity.com/courses/ud120/lessons/3335698626/concepts/33363086340923#
[2] https://classroom.udacity.com/courses/ud120/lessons/3335698626/concepts/33363086340923#
[3] https://github.com/udacity/ud120-projects/blob/master/tools/feature_format.py
[4] http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
[5] http://qingkaikong.blogspot.com/2016/11/machine-learning-6-artificial-neural.html
[6] http://scikit-learn.org/stable/modules/naive_bayes.html
[7] https://www.slideshare.net/ashrafmath/naive-bayes-15644818
[8] http://scikit-learn.org/stable/modules/svm.html#svm-mathematical-formulation
[9] http://scikit-learn.org/stable/modules/tree.html#tree
[10] http://scikit-learn.org/stable/modules/ensemble.html#adaboost
[11] http://www.nickgillian.com/wiki/pmwiki.php/GRT/AdaBoost
[12] http://scikit-learn.org/stable/modules/ensemble.html#random-forests
[13] http://amateurdatascientist.blogspot.com/2012/01/random-forest-algorithm.html

4. Summary

This project is the final project of the course Intro to Machine Learning on Udacity. Attempt to try a variety of several classifiers, some classifiers such as Decision Tree, Adaboost seems to be overfitting cause poor performance on test data. After some visualization, I realized that data with very noisy and messy distribution, it's difficult to models. Then, I go back to preprocessing step, remove more data points was considered seem to be outliers. Because of that, I have better models, but it also seem to be overfitting, I decide to tune some parameters of models to reduce overfitting and finally I have all models with the accuracy over 90%.

5. Future work

This is my first project when I firstly took a step toward Machine Learning. I've learned a lot of several excited classifiers, but I do not still go deeper about theory and mathematics of these classifiers, I will dive into it deeper soon because theory and mathematics are beautiful amazing and definitely needed.

No comments :

Post a Comment

Leave a Comment...