One of our missions here at Visual Meta is to categorize products from our partner shops as accurately as possible. With a product catalog consisting of more than 100 million items distributed across many different countries, describing the products manually is an insurmountable task. Instead we harness the power of machine learning algorithms and natural language processing (NLP) to automate the products’ categorization process.
In the machine learning group, we constantly seek ways to improve the classification process. The current system is based on deterministic methods to generate feature vectors. The feature vectors are quite simple, each dimension represents a word and its value is determined using frequency based embeddings, such as by measuring co-occurrence of words or simply their occurrence count. Although such feature vectors are easily interpretable, there are some potential shortcomings:
In a previous article , prediction based embeddings are learned on products’ textual attributes in an unsupervised way. In this article, we will extend this approach by learning product embeddings in a semi-supervised way using deep neural networks. First we learn word embeddings with Word2Vec on unlabeled domain specific data (fashion products). Then the embeddings are loaded into a vector space model and fine-tuned via supervised learning on labeled data. This enables the generation of feature vectors for the product catalog where similar products are mapped to nearby points. The result of this can be used for better recommendation of products and label classification. In Figure 1, we can see a proposed architecture of such model. Each word appearing in the product’s description is mapped to its corresponding embeddings, which is then used by the model to predict its true label. Once a network is fully trained, we can extract the output of an arbitrary hidden layer as a vector representation of the product’s description.
Figure 1: An example of a neural network vector space model using word embeddings.
To ensure maximum discoverability of products, we assign them labels/categories. Our categories have a hierarchical tree structure, going from abstract (e.g. shoes) to more detailed characterization of products (e.g. sneaker, sport shoes, etc.).
Using neural networks to make prediction on hierarchically structured output is not so obvious, so we approximated the task as a multi-label classification problem.
As dataset, we used all products under Fashion Accessories in Slovakia. We trained a Word2Vec model using all textual data and then fine-tuned it using only completely described products, that is products which are found in the leave nodes of the category tree. In total the dataset has
We use basic text cleaning such as punctuation removal, lowercasing and conversion of non-ASCII characters to closest ASCII equivalent before training.
After we trained the Word2Vec model, we did a brief sanity check to evaluate whether the learned embeddings are sensible.
For the classification part, we used bidirectional LSTM. They have shown very promising results in understanding context better on NLP tasks as they preserve information from both past and future .The labeled data samples becomes scarce deeper in the category tree. The following graph shows the label distribution across categories and subcategories.
The dataset is highly imbalanced, thus learning of the infrequent labels may become intractable without counter measures. To tackle this problem, we used Weighted Binary Cross-entropy loss function, where the weights are proportional to the label size. This is a way to tell the model to “pay more attention” to samples from under-represented classes.
As an evaluation metric we used Macro Average F1-Score. A macro-average treats all classes equally, by computing F1 score independently for each class and then taking the average.
To evaluate the model’s performance, we split the dataset into train/validation/test using stratified sampling, i.e. by preserving the label distribution for each set, as it eliminates the possibility of splits having no sample of a rare label. This has shown to improve upon standard validation in terms of bias and variance, as oppose to simple random sampling .
After training the proposed vector space model, the following results are acquired on convergence:
Furthermore, we can see in the following graph that the training procedure is stable and the model generalizes well. The performance is better on the validation data than then training data. That may simply be due to the fact that the validation data is “easier” to classify than the training data.
Finally, we generated a t-SNE visualization of all labeled data with each product colored based on their top level categories (around 100 thousand products in total and 26 categories). We did this by extracting the output of the last fully connected layer as seen in Figure 1. Note that items can have multiple labels, but are drawn with a single label on the graph.
The model generalizes well and produces meaningful embeddings of products. As seen on the t-SNE visualization, the products cluster quite nicely around their associated top level categories. Three categories have overlapping clusters (upper center and lower right), namely knit scarf, scarf / shawl and snood scarf / loop scarf. This indicates that the model has difficulties differentiating between these products. A potential solution to this problem would be to combine them under one category (e.g. scarf).
In this article we presented a deep neural network based model which we trained on large corpus of fashion products to assign products to categories. We tailored the model to optimize its performance on our hierarchical category tree by predicting the entire tree structure (attempting to completely describe every product). In spite of the inherent data issues (imbalanced and scarce labeled data), the model was able to learn good representation of products’ textual attributes and discover semantic and syntactic similarities among them. Additionally, the semi-supervised scheme proved to be appropriate for the task of classifying products. As future work we plan to try out input data with different modalities, e.g. combination of product’s price, image features and textual attributes.
Stay tuned for more updates on the project.