Highlighted

What engineering did in the June maintenance weekend

Hi All,

I thought I would pop a quick message on here about what has happened over the weekend of June 17-18. We sent out the standard notification to our customers that InfoMap would not be available over the weekend for some scheduled maintenance, but from my own experience of these from other companies I am always left wondering what they did and why they went to the extreme of basically blocking the product from working.

Background

InfoMap uses elastic search behind the scenes to index all the data that we get in. Except we don’t just create one index per customer, we have a lot of indexes per customer. Each serves a different purpose and is optimised to handle its particular job the best way possible. This is the model we have been using for years during initial development and it has worked well for us. However, a few months ago we added the item type exploration and customisation features to the product. At the time, we knew this would drastically increase the load on our clusters and so we provisioned more resources to handle that. As some of you may have noticed, this didn't work out as expected and InfoMap became slow and errors were happening more and more frequently. This served as a shameful wakeup call that we were doing something wrong. So, we looked for a better solution.

The plan

There were 3 parts to the solution: upgrade elastic search version; refactor our elastic search code for greater optimisations; and review the elastic search sharding configuration.

We were upgrading from ES 1.x to ES 5.x and the code for 1.x we were using just wouldn't work with ES5. Sounds like a big jump, but ES skipped versions 3 and 4 so it was only 2 major versions. That’s still enough for the 2 versions not to play nice with each other which ruled out an in-place upgrade. Instead we created an entirely new cluster and wrote a utility to migrate the data between them. This is the main reason we asked for a maintenance window so that we weren't trying to migrate data while it was still being written to. I’m happy to report this was a complete success and is working well. Hopefully you will be seeing speed improvements when you log into the map.

Part 2 goes hand in hand with part 1. Code to talk to ES1 just does not work with ES5 so we refactored all our services to use ES5 and where possible optimised queries and connection handling. There is still work to be done on the optimisation front, but we are getting there and over the coming weeks incremental improvements should be seen.

Part 3 is where we are at now. Even as I type this there are tweaks being made to both veritas service settings and elastic search configuration to squeeze every bit of performance we can out of it.

Lessons learnt

This work should improve the performance not only of the map, but also our own backend processing systems. It is by no means a magic bullet to fix all problems, but it is a big step in the right direction. In taking that step we have also learnt a lot of lessons to take forward. For one thing, we don’t intend on waiting until we have to upgrade our systems. We will be more rigorous in keeping all the tech running smoothly. We also learnt we need to improve our monitoring to highlight when things are struggling, not just when they have failed.

Here in engineering we feel the pain customers do (although maybe in different ways). When systems don’t respond in a timely manner the on-call engineer gets a call from the monitoring system. Having just finished a week on-call I promise you I feel it when I am called at 1:30am about the site being slow or a server maxing out resources. After this weekend I’m looking forward to a quiet week the next time I am on-call.

Adam