Data Mining @ BCM

Homework Assignment 5: Classifiers and their Decision Boundaries (deadline: noon, Fri, Feb 21)

This homework assignment is a part of the accompanying material for the Baylor College of Medicine's Data Mining course. Please turn in your one-page report through slack, as described below, by the deadline communicated course's slack channel.

The material is offered under Create Commons CC BY-NC-ND licence.

Chapters

Assignment

Assignment

In this homework assignment, you will study and experiment with different classifiers and reason about their decision boundaries.

Consider the following classifiers:

classification tree with a depth of 1, so-called a "stump," where you will set the parameter of the tree learner "Limit the maximal tree depth to" to 1 (stump)
classification tree with a depth of 3 (tree)
unregularized logistic regression; do not use regularization, set regularization type to None (logreg)
random forest with ten trees (forest)

Paint the following data sets:

A) a data set where logreg succeeds and stump fails,

B) a data set where tree succeeds and stump fails,

C) a data set where tree succeeds and logreg fails,

D) a data set where forest succeeds and logreg fails,

E) a data set where forest succeeds and tree fails.

Each data set should include at least 200 data items. Paint the data sets with exactly two classes. Do not paint the data set with three, four, or more classes.

The success of each of the methods should be judged by cross-validated AUC, where we assume here that AUC of above 0.95 denotes success and AUC below 0.8 denotes a failure.

Submit the homework as a one-paged (!) report in PDF. The report should include:

The title of the homework, your name, and your email.
Scatter plot of each of the five painted data sets, showing the true class and the classifications by the method that for that particular data sets results in higher AUC (e.g., for data set A, the plot should show the classifications of logreg). Use color and shape of the data points accordingly, and, if you decide to do so and if the color denotes the classification by the machine learning model, switch on Show color regions to highlight the decision boundary.
In the caption of each figure report the computed cross-validated AUC for both methods that you are comparing.
Include a figure with a workflow you have used for any one of the five problems. Make sure the workflow includes only the widgets necessary to answer the question.
Finish with a short paragraph with any of your comments on the results or any discussion.

The report should not exceed one page! This page limit is strict. Make sure, though, that the figures are rendered well and are legible. Submit your report as a PDF document through direct message to me in the Slack.