Content Based Image Retrieval without Metadata*

10 min readMay 22, 2020

“Simple can be harder than complex: You have to work hard to get your thinking clean to make it simple. But it’s worth it in the end because once you get there, you can move mountains.”
― Steve Jobs

Introduction

Content-based image recognition (CBIR) refers to the retrieval of similar images from the dataset by providing an image as a query. It is equivalent to providing the human vision to the computer. Let’s consider a scenario. While browsing through some posters, you get wonderstruck by the beauty of a dress you see on it but didn’t get the time to enquire about it.

Shifting from the traditional paradigm of providing the description of the dress on various websites and markets, a click and search mechanism would just require the customer to provide the search portal with the image of the product to get the desired results in the order of resemblance. The idea has already been implemented by many search engines such as Google Images.

Our work aims at retrieving the relevant images without providing unnecessary metadata. The ground truth of available data is unknown. Computer vision algorithms, deep learning, similarity techniques, and unsupervised learning have been combined together to make this task possible.

Computer vision techniques and deep learning algorithms together are used for the feature extraction process. Each image has been represented as a feature vector. Similarity techniques such as Cosine Similarity and Euclidean Distance are used to measure the closeness between feature vectors of the query image and the images available in the dataset. In this project we are taking image from the user and displaying the results.

Dataset Description

CBIR-50 - It consists of 10,000 images, which are clustered into 50 categories namely Mobiles, India Gate, Kangaroo, Jeans etc. , each category has 200 images of varying sizes.

Data-set can be downloaded from here.

Shoe Dataset- It consists of 50,025 catalog images collected from Zappos.com. The images are divided into 4 major categories — shoes, sandals, slippers, and boots.The shoes are centered on a white background and pictured in the same orientation.

Data-set can be downloaded from here.

Holiday Dataset - The dataset includes a very large variety of scene types (natural, man-made, water and fire effects, etc) and images are in high resolution. The dataset contains 1491 images, each of which represents a distinct scene or object.

Data-set can be downloaded from here.

Baseline

The basic assumption is that image can be of two types- an image having an atomic object and image can have multiple objects. For atomic object-image, the image can be directly retrieved after the object detection. To detect object we need to build a mask to remove the background.

Here is the algorithm to create the mask -

Invert the image and apply edge detection.
Apply gaussian blur technique to eliminate noise and smooth outliers.
Apply thresholding and adaptive thresholding to remove lone pixels.
Apply the mask to the original image.
Extract features of the image.

For image having multiple objects image segmentation or partition is required and for each segmented image extract the features of the image. After feature extraction image search is done by similarity matching between the query image and the database images.

Architecture

Preprocessing

Holiday dataset and CBIR-50 images were of different dimensions and aspect ratios. In order to train models for these images, they were scaled to minimum dimensions present in the dataset (240px X 240px). The images were not square therefore images were cropped from the center for further implementation. Images in the shoe dataset were clear and did not require much preprocessing. Unlike Holiday Dataset, all images were of the same dimensions.

Feature Extraction

There are two types of features that help in describing the images: Global Descriptors and Local Descriptors. The global descriptors describe the image as a whole. These are used in the image classification task. It is important to extract the global features from an image because sometimes the images can be non-atomic, i.e., there can be more than one object in an image. Consider the image of a child playing with a dog, it can either be categorized as a child or a dog. Therefore, for the task of image retrieval it is necessary to extract the global features.

We have used the following global descriptors:

Histogram matching for images is an important feature for searching similar images as it gives a measure of similarity in color distribution. Similar images have a similar color distribution over the image.

Segmented Histogram feature was used to overcome the drawbacks of simple histogram where important features are localized in a segment of an image rather than the whole image.

Hu Moment feature is a global feature extraction technique used to extract the shape of an important object from images.

Hog Feature is an edge detection technique which works on the principle of histogram. It identifies the sharp shift in the intensity of the pixels and returns the gradients in the direction.

Haralick feature is used to quantify an image based on texture.The fundamental concept involved in computing Haralick Texture features is the Gray Level Co-occurrence Matrix.

Deep learning is believed to be a powerful tool for feature engineering. The advantage of deep learning is instead of providing the filters for performing convolution over an image to extract various features of an image such as vertical edges, horizontal edges, noise distribution, etc. the model itself learn the filters and weights during each epoch. For this problem, we have used transfer learning for feature extraction.

Apart from global features, the image has been described with the features extracted by using VGG16 and Resnet50. Both of the models are trained on the ImageNet dataset and proved to perform well for our task. ImageNet has about 22,000 types of images.

These features are undoubtedly helpful in retrieving relevant images corresponding to a query. However, in order to use these features optimally, they must be used in combinations rather than in isolation.
Following method can be used to obtain any number of combinations of features mentioned above in order to generate most relevant search results.

Cluster Pruning

Since search space is extremely large and the query will take more time in retrieval so it is better to optimize the retrieval. In searching, time is also a constraint because the user will not wait to get the result he would better prefer to get some false results rather than waiting and getting more accurate results. To reduce the search time we need to reduce the space. Clustering is a better way to reduce the search space.

We first match the query image with all the cluster centers, and searching is done in the cluster having the most similar cluster center, and remaining clusters are pruned. We have used k-means clustering and spectral clustering to accomplish this task.

Feature Matching

kNN or k — nearest neighbour

It is a simple, proximity based algorithm that makes a naive assumption that all similar data points (images) must be present in close proximity with each other. kNN’s efficiency increases if such points are also comparatively far apart from non relevant data points in the dataset hyper-plane. For a dataset, if the above assumptions hold, then, kNN becomes a powerful method to retrieve most relevant search results for a query.

Two major drawbacks associated with real life implementation of kNN is that it is rarely the case that data points are present in pockets of similar data points. This renders kNN inefficient at times. Moreover, as the number of data points under consideration increases, kNN becomes slower.

Cosine Similarity

Another approach for determining similarity between any given pair of images is cosine similarity. It is a measure of how closely aligned two data points are in the feature space. In other words, it is the measure of difference between feature vectors of the two data points. It seems that most of the similar images have comparatively smaller differences between their feature vectors which makes it a powerful measure technique for relevant image retrieval.

Cosine measure is generally more robust when compared with kNN. However, its inability to efficiently tackle variance along a particular (or a small set) feature is a major pitfall. In such a scenario, even similar points may appear to be far apart even if variance along only a single feature is very high.

Normalization may seem as a viable option here but it may again lead to inefficient matching as points having similar direction and varying substantially in magnitude may now appear to be closely associated. Due to this dilemma, cosine measure of similarity should not be taken as a generic measure for relevant image(information) retrieval.

Voting Similarity

Almost all matching techniques have some drawbacks. We can minimize the negative effect of each of the similarity measuring techniques by using multiple such measures together and generating a consolidated result. Therefore using some combinations of such relevant image retrieval models may prove to be a more efficient approach.

These candidates may also vary in feature set used in addition to similarity measures employed for finding closest matches for a query image. All these models may not be equally robust. Therefore, it may be feasible to weigh these candidate models based on their usefulness specific to the domain or query at hand.

Following is one particular example of voting classifier

Results and Observations

Feature matching using Cosine Similarity

Feature matching using KNN with Manhattan Distance.

Feature matching using KNN with Euclidean Distance

Feature matching using Voting Similarity

It has been observed experimentally, VGG feature set is highly efficient in retrieving relevant results, and its combination with Segmented Histogram pruned irrelevant images relatively.
Precision Recall curve using KNN on top 100 documents.

PR curve for 7 queries top 100 documents

Precision Recall curve using Cosine Similarity on top 100 documents

Precision Recall curve using Voting Similarity on top 100 documents

In addition to ensuring that relevant images are retrieved, Voting Similarity measure also ensures that relevant images retrieved by multiple models are ranked based on models’ weights and their ranks given by respective models.
It is observed that k-means work better in atomic image retrieval and spectral clustering outperforms k-means in the images having multiple objects.

a) Spectral cumulative precision for 100 queries b) K-means cumulative precision for 100 queries

Feature Matching results on Holiday Dataset and Shoe Dataset.

References

[1] Swapnalini Pattanaik, D.G.Bhalke, “Beginners to Content Based Im-
age Retrieval”,International Journal of Scientific Research Engineering
Technology (IJSRET),Volume 1 Issue2, pp 040–044, May 2012.
[2] Z. Rian, V. Christanti and J. Hendryli, “Content-Based Image Retrieval
using Convolutional Neural Networks,” 2019 IEEE International Con-
ference on Signals and Systems (ICSigSys), Bandung, Indonesia,
2019, pp. 1–7, doi: 10.1109/ICSIGSYS.2019.8811089.
[3] S. Jain and J. Dhar, “Image based search engine using deep learning,”
2017 Tenth International Conference on Contemporary Computing
(IC3), Noida, 2017, pp. 1–7, doi: 10.1109/IC3.2017.8284301.

Acknowledgement
This article is based on the project in the Information Retrieval subject of our college. Special thanks to Professor Dr. Tanmoy Chakraborty and Teaching Assistants for suggesting the project and guiding us in achieving these results.
I also want to thanks my colleague Mukul , Bhakti and Shubham for equally contributing to this project.
Contribution
Bhakti Batra (MT19115): Preprocessing, Feature Extraction, and model evaluation.
Anmol Jain(MT19005): Preprocessing and Feature Extraction, transfer learning, Model Testing, and analysis.
Shubham Verma (MT19017): baseline of normal retrieval (atomic object in the image) and segmented retrieval(multiple objects)and Cluster Pruning technique.
Mukul Sharma(MT19042): Retrieval Models Implementation (KNN, cosine similarity, and voting similarity).

Content Based Image Retrieval without Metadata*

Introduction

Dataset Description

Baseline

Architecture

Preprocessing

Feature Extraction

Cluster Pruning

Feature Matching

Results and Observations

References

Written by Anmol