When we started running our platform ladenzeile.de in Germany in 2009, things were pretty nice and cozy: It was the first country we launched. Little traffic, small number of products, small number of things actually being done with the data. We were able to run the whole platform on one single server, which even did not have much power compared to servers being usually used nowadays.
But things didn’t stay that way. With an increasing amount of products on the platform, backend processes such as importing products from shops and sorting them into categories, we needed more servers to process all the data.
In Germany, we increased our amount of products from one million to 23 million within three years:
But not only do single countries scale, Visual Meta also expanded into many other countries. In the beginning mostly Europe, but recently also countries such as Brazil and India were added to our portfolio. We have grown from one country to 19 within the last five years:
With an increasing demand of computing capacity per country and having more and more countries, the number of servers to maintain became too hard to handle. At that time, we were loadbalancing our servers manually. So each time we launched a country, we would have to manually allocate server resources to that country specifically. The same was true for processes per country. E.g. if a country added more shops, we’d have to allocate another server in order to fetch and synchronize their products in a timely fashion. Keeping product information up to date is crucial for us. This lead to a tedious process of checking server loads and occasional rebalancing of load if needed. At some point in 2013, we decided that this is not manageable anymore with the steadily increasing number of servers which we had:
We therefore started looking into cluster solutions which would handle this for us.
Specifically our goals were:
- Run applications automatically from central location
- Balance load automatically
- Servers can be added seamlessly
- Servers can die without (major) impact
- Unlimited disk space!
There are two technologies from the Hadoop world which are helping here:
- HDFS is Hadoop’s distributed filesystem. It’s very reliable and very easy to use. Data is distributed and replicated over different servers and it is rack-aware, meaning that data will be located in different racks if possible. There are no constraints in hardware, servers do not have to be the same size or model, which makes it very easy to incorporate different generations of hardware. We are currently running a cluster with approximately 100TB capacity.
A nice feature is its replication factor which can be set per file, which we sometimes exploit to distribute read-heavy files on a large number of data nodes.
After starting to use HDFS in our codebase, we have basically eliminated all local filesystem interaction in order to get around local diskspace limitations.
- YARN is Hadoop’s “Yet another resource negotiator”. It dispatches jobs into the cluster, paying attention to load balancing strategies. Adding servers to the cluster is as easy as it can be: Copy an existing configuration to the new machine, fire up the NodeManager which supervises local resources. It will announce itself to the center ResourceManager and immediately takes over jobs.
With these two technologies alone, we were able some of the major scaling issues that we had at the time, and have been able to grow our cluster faster and in an easier way than we actually expected in the beginning when we made the decision.
We are constantly looking into more technologies from the Hadoop ecosystem which we could use for our platform (and have for example integrated HBase as a major distributed data store).
As a conclusion for this article:
- Hadoop is awesome
- If you expect your platform to scale, definitely check it out.
- Initial setup can be tricky, but everything is very smooth after it is done