Redesigning our monitoring system using Kafka, InfluxDB, and Grafana

Nick RammosBy Nick Rammos 7 months agoNo Comments
Home  /  Tech Corner  /  Redesigning our monitoring system using Kafka, InfluxDB, and Grafana

At Visual Meta, the great need to monitor our applications in real time was recognized quite early. However, as these things often go in a rapidly growing company, the initial version of our monitoring system was not so much designed as grown organically. It grew to be somewhat problematic down the road both for the users and the maintainers of the system, and eventually a redesign was requested, authorized, and set under-way.

Old setup

The first version of the system essentially consisted of writing monitoring data to an Elasticsearch database, which were then displayed in various Grafana boards. A drawback of this setup was that Elasticsearch is not ideally suited to application monitoring, which caused some performance problems – especially under heavy loads.

The actual implementation consisted of a REST API which would accept monitoring events and write them to Elasticsearch. A Java client for the API was provided for the engineering teams to use. The issue here was that whenever someone wanted to write a new type of event to Elasticsearch they would have to implement the Java class for the event, and also add a new controller to the API. Understandably this created some overhead which we were keen to remove.

Redesign

The initial idea for the new design came from one of our awesome Hackathon projects using InfluxDB. As a time-series database, InfluxDB seemed like a natural fit for storing monitoring data points and a new InfluxDB instance was set up.

Since we wanted to retain historical data for some time but at the same time minimize disk space usage, we decided to use two separate retention policies. The original data would be retained for one month, after which it would be downsampled (i.e. averaged per hour) and further retained for one year. Fortunately InfluxDB allowed us to do this automatically for each measurement in the database, using the following query:

The next question then was how would we handle possible delays or downtimes of the service writing to the database. A messaging queue with some retention was a somewhat obvious answer (which would replace the REST API), and in this case we chose Apache Kafka as it seemed to tick all the right boxes: redundancy, reliability, and multiple consumers per message (more on that later). A cluster with 3 brokers was set up, and the retention policy set to one hour.

The plan

With all the infrastructure sorted out, it was time to plan the upgrade. The first obstacle was that until the Grafana boards were switched over to use the new InfluxDB data source, monitoring data would still have to be written to Elasticsearch as well. This is where Kafka’s consumer groups came in handy: one group would write the events to InfluxDB and a second one would write the same events to Elasticsearch.

The next roadblock was that, until teams updated their Java client versions, they would still be trying to call the old REST API. A few legacy callers were even not using the Java client at all but calling the API directly. This meant that this would also need to be supported for some time. We decided to update the REST controller and make it write straight back to Kafka using the updated Java client; this way we would have a single point of entry. At the same time we planned to implement an e-mail based alerting system at a later date, to notify us of usages of the REST endpoint. That way we could know when it would be safe to remove the endpoint entirely.

The new infrastructure finally looked like this:

The redesign also resulted in us being able to simplify the Java client significantly. Its entire code now essentially consisted of only some very simple validation and sending events to Kafka asynchronously:

In practice

After the rollout, the teams were able to start taking advantage of the latest and greatest features – and it turned out to be a breeze. Instead of implementing their own event classes and endpoints, they could now just create an event, add monitoring values to it, then the Mission Control client would take if from there.

For instance, an application which executes certain updates per country and wants to monitor how long they take, simply does the following:

We even implemented monitoring for the monitoring system! The team name shown in the snippet above is used by the service writing the monitoring data to InfluxDB. For each monitoring event we actually write two points: one is the actual event, and the other is metadata for the event including the team name. This way we have an easy way of monitoring which teams and/or applications incur the highest loads on the system and at which times.

It is also important to note a couple of design decisions we made for the benefit of the users of the client. The basic reasoning was that monitoring, or any issues related to it, should never interrupt or otherwise affect the applications’ operation. As such, the report() method in the client should:

  1. Never throw an exception: we handled this simply by wrapping it in a try-catch and logging any errors.
  2. Never block: as shown in the previous section, for this we used an ExecutorService which would submit any events asynchronously in case Kafka was slow. In the interest of not blowing up the applications’ memory usage we used a work queue with a max size of 1000 and set it to discard the oldest items once it was full. So we would retain at least the 1000 latest events in the event of the Kafka was cluster being unavailable for a long period. That ensured that no data were lost in the vast majority of practical cases.

We also took special care of closing the client once an application was done with it. Initially the ExecutorService would be asked to shut down within 20 seconds – this would allow enough time for all remaining events to be processed and put on the queue to be sent to the cluster. Afterwards we would wait for a maximum of another 20 seconds until all pending events were sent to the cluster, then disconnect.

Finally, the client was designed to be completely thread-safe, so that one single instance could be re-used by an application; in fact the client will perform best when used this way and no additional parallelism should be implemented on top of it. This also made it very easy for applications to define a Spring bean for the client and inject it wherever needed.

An example of the end result can be seen below – a Grafana board used to monitor one of our internal APIs, mainly tracking traffic and response times:

Category:
  Tech Corner
this post was shared 0 times
 200

Leave a Reply

Your email address will not be published.