Data Mining @ BCM

Homework Assignment 7: Image Analytics (deadline: noon, Fri, Feb 28)

This homework assignment is a part of the accompanying material for the Baylor College of Medicine's Data Mining course. Please turn in your one-page report through slack, as described below, by the deadline communicated course's slack channel.

The material is offered under Create Commons CC BY-NC-ND licence.

Chapters

Homework Assignment

Chapter 1: Homework Assignment

Image Analytics

Note that the image embedders included in Orange, such as Inception v3, work with very low-resolution images, typically around 200 × 200 pixels. Your collection may therefore contain very small images. Images in your collection can vary in size, but Orange will rescale them appropriately.

Gather a collection of 50 to 100 images (the more, the better), preferably related to biology, medicine, natural sciences, or, ideally, your own research or interests. If such images are unavailable, select any image set of your choice from the web or your photo album. Ensure the images are in JPG or PNG format and are not too large to prevent overloading the embedding server. Organize the image files into a folder, sorting them into subfolders to indicate classes. Load them using the Import Images widget and review them in the Image Viewer widget to ensure they have loaded correctly.

Now, apply the skills you have learned in this class to gain insights into your image set.

Cluster the images using either hierarchical clustering or k-means and evaluate the quality and meaningfulness of the clusters. Be sure to use cosine distance, as Euclidean distance does not perform well for this type of data—though you may try it for comparison.

Place the images in appropriate subfolders. When using the Import Images widget, load the parent directory. The image classes will correspond to the names of the subdirectories they reside in.

Can image classes be predicted from image embeddings, that is, from their vector-based descriptions obtained using the Image Embedding widget? Report the cross-validated accuracy; if your image set is small, use 5-fold cross-validation instead of the default 10-fold. Compare accuracies (e.g., AUC) for logistic regression, classification tree, and k-NN. Identify which method performs best and analyze why this might be the case.
Analyze the types of mistakes made by the learner with the best classification accuracy. Use the Confusion Matrix widget, which receives input from the Test and Score widget. Select a specific cell in the matrix corresponding to a type of misclassification, display the misclassified images from this group, and provide a comment on the errors observed.
Project the images into a two-dimensional space using PCA, MDS, or t-SNE. Evaluate whether the projection makes sense. To illustrate, create an "image map" by plotting the projected images with points marked or labeled by class. Comment on the groups of images that emerge from this visualization. In your report, highlight one such group by selecting the corresponding points on the map and displaying their images in the Image Viewer.
Include anything else, that is, any other analysis that you think makes sense and sheds light on your image collection.

The report should include figures showcasing sample images and all key results. Specifically, it should contain at least:

A dendrogram for hierarchical clustering results.
A confusion matrix highlighting classification performance and misclassifications.
An "image map" with a selection of data points and their corresponding images.

Additionally, include at least one of your selected workflows, showing only the essential widgets used for the tasks.

Besides the title, your name, and email, the report should include the following sections:

1 Introduction, which, in one short paragraph, describes the purpose of the report.
2 Data, which, in one short paragraph, describes the chosen problem (images), states where the images were obtained, and includes a figure showing a sample of images.
3 Methods, which briefly describes the methods used for data analysis in a few sentences and includes a figure of an example workflow (one that you select because, for example, it looks interesting), with an explanation in the figure caption.
4 Results, where you report the findings for each of the five tasks stated above (enumerate the subsections and give them appropriate titles).
5 Conclusion, where, in one paragraph, you summarize and comment on your findings and state whether the endeavor was successful or not, explaining why.

The report should be at most five pages long (this limit is strict!). Use an 11 pt sans-serif font (such as Calibri or Arial) with 1.2 line spacing and 6 pt spacing between paragraphs.

Prepare the report as a PDF file and send it to me via direct message on Slack. Do not send Word files.

Have fun! You're well on your way to becoming a proficient data scientist.