The task is to attempt clustering images from a provided dataset using selected algorithms. 1000 images are assigned sequentially to 10 classes, and the goal is to reconstruct the class clusters as closely as possible. The image numbers in groups of 100 can be used to evaluate the correctness of the results.
The clustering algorithm can be either one of those covered in class: k-means, EM (Expectation Maximization), hierarchical clustering, DBSCAN, OPTICS, or other algorithms available in the literature or in the machine learning library. The number of classes can be initially predetermined to simplify the first approach, but experiments should also be conducted on automatically determination of the number of clusters.
Perhaps the most important part of the project is the initial processing of images, which are of different sizes. The autoencoder can be used as initial feature extraction technique. It can be combined with image filters such as greyscale conversion or edge detection filter such as sobel or canny, to simplify images and extract more useful features. One very useful approach to feature extraction from images is the Histogram Oriented Gradients (HOG) method. Each of these approaches can be further augmented with dimensionality reduction techniques such as PCA.
Students are encouraged to use any research papers, tutorials, or textbooks, provided they are properly identified and cited in the report. Any programming environment can be used. It is desirable to compare at least two different approaches in the final report, for example the first complete clustering experiment (the baseline), and the best approach, and compare the results.
A particular challenge in clustering is evaluating the performance of the implemented algorithm. Performance measures used in supervised classification models, such as Accuracy, or Precision and Recall, cannot be applied directly. Instead, performance measures designed specifically for clustering must be employed. These come in two flavors. The first requires knowledge of the ground truth classes (as is the case in this project), and measure the consistency of cluster assignment with the original clusters. Examples of such measures include the Rand Index (RI) and the Adjusted Rand Index (ARI), Mutual Information (MI), Normalized Mutual Information (NMI), and Adjusted Mutual Information (AMI), as well as Homogeneity, Completeness, the V-measure, and the Fowlkes-Mallows Index. The second variant of the performance measure does not require the knowledge of the ground truth assignment, and measures the similarity of samples within the assigned clusters according to some similarity metric. Examples of such measures are: the Silhouette Coefficient, the Calinski-Harabasz Index, and the Davies-Bouldin Index.
In the simplest case, the Adjusted Rand Index or the Normalized Mutual Information can be calculated and used for comparison. Note that pairwise Precision and Recall are easy to calculate, and tempting for their intuitive connection with classification Precision and Recall, but cannot be easily used for comparison.
As before, the outcome of the project should be submitted in two parts: a report and a development package.
The report should include:
Report penalties:
The development package should:
Please write your scripts in a non-interactive way. The input files to
cluster should be assumed to be in the Cluster_img subdirectory of the
current location (or, optionally, in a directory specified as the command line
argument to the clustering program).
Clustering with scikit-learn:
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_clustering.html
Image processing with scikit-learn:
https://scikit-image.org/
https://scikit-image.org/docs/dev/auto_examples/features_detection/plot_hog.html
Various tutorials on clustering:
https://towardsdatascience.com/image-clustering-using-k-means-4a78478d2b83
https://towardsdatascience.com/image-clustering-using-transfer-learning-df5862779571
Clustering performance evaluation:
https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
Miscellaneous:
https://docs.opencv.org/master/d9/df8/tutorial_root.html
https://www.learnopencv.com/histogram-of-oriented-gradients/
https://machinelearningmastery.com/use-pre-trained-vgg-model-classify-objects-photographs/
https://en.wikipedia.org/wiki/Otsu%27s_method
https://kapernikov.com/tutorial-image-classification-with-scikit-learn/
Karen Simonyan, Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. Visual Geometry Group, Department of Engineering Science, University of Oxford
Giuseppe Ciaburro, Prateek Joshi. Python Machine Learning Cookbook: Over 100 recipes to progress from smart data analytics to deep learning using real-world datasets, 2nd Edition March 30, 2019