Text processing had always a big research and product attention and during the past years, lots of problems used this source of data to generate knowledge and reduce human effort in multiple levels of needs, like language translation, artificial text generation, clustering news and mainly for Visual Meta, classifying products from our catalog.
There are many techniques modeled to solve the above-mentioned needs, the most common one known as “Bag of Words (BoW)”. This is a simple approach where you make use of a defined dictionary of words (features), count the frequency that such words appear, and potentially apply some weights to some features and then you have as an output a representation (feature vector) that is ready to be used as an input to any chosen machine learning based solution (classification, clustering, recommendation, etc.).
Visual Meta has the task of classifying more than 150 million items inside of approximately 20.000 product categories for more than 6.000 shops.
With the size of this problem you can imagine that we need to deal mainly the following issues, a) shops have different ways of providing product information; b) we need to maintain a context between categories, so items are not misclassified. For the last, imagine that the shop is promoting a “red sneaker” and in the item description, they mention that “the red sneaker fits well with a nice blue jeans and white t-shirt”. Now you can figure that just counting words is not enough and we need to build a dictionary that represents each of the available categories we have.
So far, we just talked about product classification, but we also need to advertise our product catalog through different marketing channels and such often use keywords to trigger campaign ads. The use case can turn to be a clustering problem, where we need to identify keyword combinations (or clusters) that best represent a category that shall be advertised by our Marketing managers.
And, we still have the issue. We need again humans to build such context dictionaries!
From that, we had the following questions to be answered:
The first question we answered in some other experiments, described at another tech-corner post. The second, we got answered in our last Hackathon and solution is described below!
The first thing we needed for the experiments were the datasets, collected from our own product portfolio:
|Country||Category||Number of items|
For the technology we chose Apache Spark which provides a good machine learning library (mllib) and also provides a good SQL-like interface to work with data in memory.
The steps are quite common on any data-driven solution and the pipeline we used was not different.
|Dataset||Keywords generated||Context||Keywords (extract)|
|Sports – Germany||~ 50,000||“soccer”||“fußballzubehör vereinsmannschaften funktionsshirt”, “kappa dortmund”, “oberteile sportausrüstung”, “trikot stutzen fussball-fan-shop”|
|“winter sport”||“skihose lange”, “skibindungen”, “snow boots snowboardschuhe”, “snowboardartikel arbor jones”, “unterteile skihose snowboard”|
|Airbnb Flats at Amsterdam||~28,000||“surrounding area of the flat”||“areas neighborhoods vibrant”, “quiet peaceful yet”, “green calm safe”, “appartment”, “conveniently ideally”|
|“activities and sightseeing”||“sites touristic cycling”, “vondelpark foodhallen hallen”, “lake surrounding swimming”, “far biking”, “surrounding nature”, “amstel rembrandt”|
On this hackathon, we could prove how powerful and easy word embeddings can now be used to solve many industry issues which deal with language context preservation. For Visual Meta is a big step, since we use text context in our classification algorithms and Word2Vec gave comparable results while competing with our in-house dictionary.
It’s not the silver bullet but we can use, for example for dictionary enrichment or even stacking classifiers that need specific features (i.e. inferring and applying gender filters to specific fashion categories).