Clustering image files
The task is to attempt clustering images from a provided dataset using
selected algorithms. The 1000 images are collected in turn in 10 classes, and
the goal is to reconstruct the classes as closely as possible. Image numbers
in the groups of 100's can be used to compute the correctness of the results.
The clustering algorithm can be either one of those covered in class: k-means,
EM (Expectation Maximization), hierarchical clustering, DBSCAN, or other
algorithms found in literature or available from the machine learning library.
The number of classes can be initially predetermined, to simplify the first
approach, but experiments with automatic determination of the number of
clusters should also be performed.
The most important part of the project is the initial processing of the
images, which are of different sizes. The autoencoder can be used as initial
feature extraction technique. It can be combined with image filters such as
greyscale conversion, edge detection filter such as sobel or canny, to
simplify images in the hope to extract more useful features. One very useful
approach to feature extraction from images is the Histogram Oriented Gradients
(HOG) method. Any such approaches can additionally be augmented by dimension
reduction techniques, such as PCA.
Students are encouraged to take advantage of any scientific papers, tutorials,
textbooks, provided they are properly endorsed and cited in the report. Any
programming environment can be used. In the final report it would be good to
compare at least two different approaches, for example the first successful
clustering experiment, and the best approach, and compare the results. Please
compute the results compared to the original 10 classes, given at least as
precision and recall.
Deliverables
As before, the outcome of the project should be submitted in two parts: a
report and a development package.
The report should include:
- NO title page, list of contents, figures, etc., only a compact header:
project title, class name, author, date
- preliminary analysis of the problem and data, if any
- outline of the clustering experiments performed: method, software used,
data processing, distance measure, convergence condition, an approach to the
determination of the number of classes, results
- summary: best approach and results
Report penalties:
- report unnecessarily long
- results not precisely and completely stated for each experiment
- raw results from the program or screenshots pasted instead of a summary
- too much data with no clearly given summary
- too much precision with not attempt for proper rounding
- no clear outline of the results
The development package should:
- be possible to run on Linux,
- contain a Readme.txt file describing how to run it, including
required software packages, their versions and how to install them,
- allow to process and cluster a dataset structured identically
to the original dataset, and output the results in the same way as shown in
the report,
- for the original dataset the same results as shown in the report
should derive; sufficient to obtain the best set of results from the report,
- SHOULD NOT contain the original data set.
Please write your scripts in a non-interactive way. The input files to
cluster should be assumed to be in the Cluster_img subdirectory of the
current location (or, optionally, in a directory specified as the command line
argument to the clustering program).
Useful literature
https://scikit-image.org/
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_clustering.html
https://scikit-image.org/docs/dev/auto_examples/features_detection/plot_hog.html
https://docs.opencv.org/master/d9/df8/tutorial_root.html
https://www.learnopencv.com/histogram-of-oriented-gradients/
https://opencv-python-tutroals.readthedocs.io/en/latest/index.html
https://www.analyticsvidhya.com/blog/2019/08/3-techniques-extract-features-from-image-data-machine-learning-python/
https://www.analyticsvidhya.com/blog/2019/09/feature-engineering-images-introduction-hog-feature-descriptor/
https://towardsdatascience.com/image-clustering-using-k-means-4a78478d2b83
https://towardsdatascience.com/image-clustering-using-transfer-learning-df5862779571
https://towardsdatascience.com/introduction-to-image-segmentation-with-k-means-clustering-83fd0a9e2fc3
https://franky07724-57962.medium.com/using-keras-pre-trained-models-for-feature-extraction-in-image-clustering-a142c6cdf5b1
http://www.adeveloperdiary.com/data-science/computer-vision/how-to-implement-sobel-edge-detection-using-python-from-scratch/
https://machinelearningmastery.com/use-pre-trained-vgg-model-classify-objects-photographs/
https://en.wikipedia.org/wiki/Otsu%27s_method
https://kapernikov.com/tutorial-image-classification-with-scikit-learn/