Word embeddings on top of unstructured data to generate insights for marketing campaigns

Claudio VillarBy Claudio Villar 11 Monaten ago
Home  /  Tech Corner  /  Word embeddings on top of unstructured data to generate insights for marketing campaigns

Text processing had always a big research and product attention and during the past years, lots of problems used this source of data to generate knowledge and reduce human effort in multiple levels of needs, like language translation, artificial text generation, clustering news and mainly for Visual Meta, classifying products from our catalog.

There are many techniques modeled to solve the above-mentioned needs, the most common one known as “Bag of Words (BoW)”. This is a simple approach where you make use of a defined dictionary of words (features), count the frequency that such words appear, and potentially apply some weights to some features and then you have as an output a representation (feature vector) that is ready to be used as an input to any chosen machine learning based solution (classification, clustering, recommendation, etc.).


Visual Meta has the task of classifying more than 150 million items inside of approximately 20.000 product categories for more than 6.000 shops.

With the size of this problem you can imagine that we need to deal mainly the following issues, a) shops have different ways of providing product information; b) we need to maintain a context between categories, so items are not misclassified. For the last, imagine that the shop is promoting a “red sneaker” and in the item description, they mention that “the red sneaker fits well with a nice blue jeans and white t-shirt”. Now you can figure that just counting words is not enough and we need to build a dictionary that represents each of the available categories we have.

So far, we just talked about product classification, but we also need to advertise our product catalog through different marketing channels and such often use keywords to trigger campaign ads. The use case can turn to be a clustering problem, where we need to identify keyword combinations (or clusters) that best represent a category that shall be advertised by our Marketing managers.

And, we still have the issue. We need again humans to build such context dictionaries!

From that, we had the following questions to be answered:

  • “Can we enrich our dictionaries in an unsupervised manner that can increase the accuracy of our product classification?”
  • “Can we find word correlations on top of unstructured data and get insights of how we can target marketing campaigns in a more automatic manner?”

The first question we answered in some other experiments, described at another tech-corner post. The second, we got answered in our last Hackathon and solution is described below!


The first thing we needed for the experiments were the datasets, collected from our own product portfolio:

Country Category Number of items
Brazil Furnitures 181.394
Germany Sports 170.000
Germany Wardrobes 324
Sweden Furnitures 238.240

Since we’re also trying to generalize an approach that can work on any kind of context, we got a dataset from Airbnb, which contains some flats at Amsterdam (source, dataset).

For the technology we chose Apache Spark which provides a good machine learning library (mllib) and also provides a good SQL-like interface to work with data in memory.

The steps are quite common on any data-driven solution and the pipeline we used was not different.

  • Data cleaning
    • We basically removed language specific stopwords and unnecessary characters, and tokenization. Apache Spark provides a good API for doing that in an easy manner, you can also increase the dictionary if needed.
  • Embedding model
    • Here is where the all magic happens and the magic is Word2Vec, which basically receives as input a set of tokens and outputs word embeddings, that is represented by a feature vector that maintains linguistic context information.The internet is full or resources regarding Word2Vec and various implementations are already available for easy use. Apache Spark has its own (https://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec) and was the one we chose.
  • Clustering
    • Once having the Word2Vec model in hand, we did a K-Means clustering and then we started playing with the k parameter that would make sense to us, in order words, where the context of the words was perceived by Marketing Manager and this one would be happy with.
  • Visualisation
    • For the visualisation, we used a simple Principal Component Analysis (PCA) on top of the features and plotted the clustering output using Highcharts (https://code.highcharts.com) for better interactivity. We used PCA, but there is a trend to use t-SNE instead. For our case, didn’t change much the results – just keep in mind that there are other approaches for dimensionality reduction that can help checking the data.
  • Boosting Marketing Campaigns
    • At this point, we have almost everything. We have the word embeddings the preserved the context for each dataset we have input into our model. We have the word clusters that would visually show the potential keywords that could be in the end be used by our Marketing team to boost our keyword inventory. Here are some results. It’s important to notice that the “context” column refers to our understanding of a keyword cluster.
Dataset Keywords generated Context Keywords (extract)
Sports – Germany ~ 50,000 “soccer” “fußballzubehör vereinsmannschaften funktionsshirt”, “kappa dortmund”, “oberteile sportausrüstung”, “trikot stutzen fussball-fan-shop”
“winter sport” “skihose lange”, “skibindungen”, “snow boots snowboardschuhe”, “snowboardartikel arbor jones”, “unterteile skihose snowboard”
Airbnb Flats at Amsterdam ~28,000 “surrounding area of the flat” “areas neighborhoods vibrant”, “quiet peaceful yet”, “green calm safe”, “appartment”, “conveniently ideally”
“activities and sightseeing” “sites touristic cycling”, “vondelpark foodhallen hallen”, “lake surrounding swimming”, “far biking”, “surrounding nature”, “amstel rembrandt”


On this hackathon, we could prove how powerful and easy word embeddings can now be used to solve many industry issues which deal with language context preservation. For Visual Meta is a big step, since we use text context in our classification algorithms and Word2Vec gave comparable results while competing with our in-house dictionary.

It’s not the silver bullet but we can use, for example for dictionary enrichment or even stacking classifiers that need specific features (i.e. inferring and applying gender filters to specific fashion categories).


  • Manuel Jain (Software Engineering Team Lead)
  • Mykola Karaman (Software Engineer)
  • Claudio Villar (Senior Product Manager)
  • Malin Levin (Marketing Manager)
  • Beata Bednarksi (Marketing Team Lead)
  • Sebastian Grebasch (Marketing Director)
  • Ralph Ward (Quality Manager)


  Tech Corner
this post was shared 0 times
Claudio Villar

 Claudio Villar

  (1 articles)

Claudio Villar works as Senior Product Manager for the Machine Learning group at Visual Meta. He holds an MBA in Project Management and has more than 10 years working on research and product development.