Clustering image files

The task is to attempt clustering images from a provided dataset using selected algorithms. The 1000 images are collected in turn in 10 classes, and the goal is to reconstruct the classes as closely as possible. Image numbers in the groups of 100's can be used to compute the correctness of the results.

The clustering algorithm can be either one of those covered in class: k-means, EM (Expectation Maximization), hierarchical clustering, DBSCAN, or other algorithms found in literature or available from the machine learning library. The number of classes can be initially predetermined, to simplify the first approach, but experiments with automatic determination of the number of clusters should also be performed.

The most important part of the project is the initial processing of the images, which are of different sizes. The autoencoder can be used as initial feature extraction technique. It can be combined with image filters such as greyscale conversion, edge detection filter such as sobel or canny, to simplify images in the hope to extract more useful features. One very useful approach to feature extraction from images is the Histogram Oriented Gradients (HOG) method. Any such approaches can additionally be augmented by dimension reduction techniques, such as PCA.

Students are encouraged to take advantage of any scientific papers, tutorials, textbooks, provided they are properly endorsed and cited in the report. Any programming environment can be used. In the final report it would be good to compare at least two different approaches, for example the first successful clustering experiment, and the best approach, and compare the results. Please compute the results compared to the original 10 classes, given at least as precision and recall.

Deliverables

As before, the outcome of the project should be submitted in two parts: a report and a development package.

The report should include:

NO title page, list of contents, figures, etc., only a compact header: project title, class name, author, date
preliminary analysis of the problem and data, if any
outline of the clustering experiments performed: method, software used, data processing, distance measure, convergence condition, an approach to the determination of the number of classes, results
summary: best approach and results

Report penalties:

report unnecessarily long
results not precisely and completely stated for each experiment
raw results from the program or screenshots pasted instead of a summary
too much data with no clearly given summary
too much precision with not attempt for proper rounding
no clear outline of the results

The development package should:

be possible to run on Linux,
contain a Readme.txt file describing how to run it, including required software packages, their versions and how to install them,
allow to process and cluster a dataset structured identically to the original dataset, and output the results in the same way as shown in the report,
for the original dataset the same results as shown in the report should derive; sufficient to obtain the best set of results from the report,
SHOULD NOT contain the original data set.

Please write your scripts in a non-interactive way. The input files to cluster should be assumed to be in the Cluster_img subdirectory of the current location (or, optionally, in a directory specified as the command line argument to the clustering program).

Useful literature

https://scikit-image.org/

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_clustering.html

https://scikit-image.org/docs/dev/auto_examples/features_detection/plot_hog.html

https://docs.opencv.org/master/d9/df8/tutorial_root.html

https://www.learnopencv.com/histogram-of-oriented-gradients/

https://www.analyticsvidhya.com/blog/2019/08/3-techniques-extract-features-from-image-data-machine-learning-python/

https://www.analyticsvidhya.com/blog/2019/09/feature-engineering-images-introduction-hog-feature-descriptor/

https://towardsdatascience.com/image-clustering-using-k-means-4a78478d2b83

https://towardsdatascience.com/image-clustering-using-transfer-learning-df5862779571

https://towardsdatascience.com/introduction-to-image-segmentation-with-k-means-clustering-83fd0a9e2fc3

https://franky07724-57962.medium.com/using-keras-pre-trained-models-for-feature-extraction-in-image-clustering-a142c6cdf5b1

http://www.adeveloperdiary.com/data-science/computer-vision/how-to-implement-sobel-edge-detection-using-python-from-scratch/

https://machinelearningmastery.com/use-pre-trained-vgg-model-classify-objects-photographs/

https://en.wikipedia.org/wiki/Otsu%27s_method

https://kapernikov.com/tutorial-image-classification-with-scikit-learn/

Karen Simonyan, Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. Visual Geometry Group, Department of Engineering Science, University of Oxford

Giuseppe Ciaburro, Prateek Joshi. Python Machine Learning Cookbook: Over 100 recipes to progress from smart data analytics to deep learning using real-world datasets, 2nd Edition March 30, 2019