ShopALook – visual recommendation of fashion products using deep learning

AvatarBy Sandip Mukherjee 1 year ago
Home  /  Tech Corner  /  ShopALook – visual recommendation of fashion products using deep learning

The ShopALook Visual Recommender

It’s easy to browse lots of great fashion images online, but not as easy to find the products in an online store to buy them. The ShopALook Visual Recommender makes this possible! Visual Meta has a huge catalogue of fashion items. Using deep learning, ShopALook can detect products displayed in online images in fashion magazines or blogs. Similar looking products are found in the Visual Meta inventory, and suggested to the user.

This is done in two steps:

Detection – detecting the fashion object using class activation mapping

Recommendation – recommending similar products using class activation mapping

Fashion Object detection using class activation mapping

Detecting objects in real life images is hard. The state of the art detection algorithms need extensive training data i.e. bounding box annotated data for the object we are trying to detect.

We use Class Activation mapping to detect the object using a simple CNN classifier. CNN trained on object categorization is successfully able to localize the discriminative regions for that object so that for example, if it is a shoe then it can detect where in the image the shoe is located. The steps are as follows:

We use a network architecture similar to VGG16 with 5 convolutional layers with max pooling. 

Instead of fully connected layers, we use global average pooling (GAP) after the last convolutional layer which directly connects to the softmax layer. We train this CNN with classes of different fashion objects such as Shoes, dresses, bags etc.



In the classification phase, we slide a window box of a certain height and width with step and classify if it contains a specific class (such as Shoe) and get activations responsible for classification of that class from the last convolutional layer. Important regions are linear combinations of the convolutional feature maps and the weights of the output layers.


We then add up all positive activations in the image for the class to produce a heatmap and do thresholding on the image. Below is an example heatmap for shoe.  From the heatmap, we use contour detection to get the bounding box that encloses the object.



Fashion item recommendation using multi-label CNN classifier



We use a multi-label CNN classifier for image recommendation. CNN works great not only for classification but it learns good feature representations which can be used to compare images to find similar images from a large set.

The challenge in our case is the domain difference between the query image and the product images we have in our inventory. Query images from blogs have real life backgrounds, different angles in comparison to product images.

For example, the above images belong to the same model. Image B should be recommended when image B comes as a query. To do this comparison, CNN deep network should learn domain invariant features. We solve this by using images from these two domains for the same tag sets. The process of recommendation happens as follows:

We use our tags as labels for images from our catalog. One image has multiple labels i.e. there are multiple ones in the hot encoded vector. For example, one image can have the labels [shoe, sneaker, low-sneaker, nike, black]


We collect images from blogs for the same tagsets as in our catalog to achieve domain invariance. We do data augmentation (flipping, rotation, translation, gaussian noise injection) to increase the size of data for the tagsets which doesn’t have enough images.


When an object is detected in a query image we feed the cropped/detected object to our multilabel classifier to get the feature representation of the image. We also classify the image to get the tagset to look for similar images. We then compare it against all images in our catalog for that tagset using euclidean distance to give back closest matches.




After training our CNN classifier we use the last fully connected layer to extract features for all images in our catalog.

We train the VGG16 network with sigmoid crossentroppyloss for multi-label classification. To take advantage of pre-trained features we initialize our network with a pre-trained model trained on imageNet dataset. We then fine-tune the convolutional layers and train the fully connected layers from scratch to fit our data. The network structure is as follows: 

To test our recommendation, we make test data with manually annotated recommendations. Then we observe recommendation recall in the test data i.e., if 6 of the recommended items are present in the ground truth set, the recall is 0.6.

We observe that even though the recall is quite good for most of the test cases, the ranking of the recommendation is not correct in many cases. We plan to improve ranking by using metric learning to learn a custom metric for our dataset instead of using euclidean distance in future.

Stay tuned for more updates on the project.

  Tech Corner
this post was shared 0 times

 Sandip Mukherjee

  (1 articles)