Pulling my hair

I’m pissed off.

I’ve spent an insane amount of time struggling with an epic Elasticsearch bug because broaden my focus to consider the problem from a higher point of view.

This post is about my Elasticsearch bug and investigating issues in a complex environment. If you don’t know what Elasticsearch is, please jump to the conclusion.

I’m running a tiny Elasticsearch cluster on Amazon Web Service made of 1 routing and 3 data nodes. Each node runs on a 2 core and 7.5GB RAM m3.large virtual machine. The cluster has about 600GB of data, and has been running smoothly since early April.

2 weeks ago, Amazon had to reboot about 10% of EC2 to fix a Xen bug. This reboot operation included the routing and one of the data nodes. Since I hate being awaken by monitoring alerts, all the machines are running within EC2 Autoscaling Groups. Autoscaling groups are great to upscale or downscale a platform. They’re even greater at replacing machines when they crash.

To avoid a service interruption, I’ve upscaled the routing node Autoscaling Group to ensure a spare one. I was ready for Amazon mass reboot.

The routing node acts as the cluster master nodes. On an Elasticsearh cluster, running 2 master nodes is OK. With the proper configuration it even prevents from split brain when a network issue happens.

Everything happened as expected:

  • Amazon rebooted the main routing node.
  • The spare node did the job during the reboot time.
  • The usual master came back as epxected.

Then, I downscaled my group to keep one routing machine only. That’s the time my issues started.

Until that day, my routing node memory consumption had been all flat. It suddenly started to grow lineraly until the OOM trigger killed the process. And again. And again. And again.

I started investigating, and investigating the wrong way.

My first assertion was:

Since this machine is the one that has been running for months I must have lost up a runtime setting. I didn’t save that setting and now I need to find what it’s about.

It was my first mistake.

I decided the problem was on the routing node since this behavior only happened there. I narrowed my point of view on that single machine based on a partial observation.

I launched a second routing node to see if it behaved the same way. It didn’t. Only the main master node had the memory issue.

It should have had the same problem but I did not setup perfect experimentation conditions. The cluster configuration mentions a single master node. I should have updated it to take a second one into account.

I then made 2 other assertions:

  1. The virtual machine was corrupted since it worked with one using the same AMI.
  2. Only the machine getting traffic from the clients had the memory issues.

I killed the apparently corrupted virtual machine and replaced it with another one. The problem remained the same.

That’s when I started to focus on Elasticsearch and Java Virtual Machine memory allocation. I read an insane amount of docs about Java memory management. That’s the positive point. I know more about the JVM, memory allocation and garbage collection than I’ve ever expected to.

I did lots of test. I tuned both my JVM and Elasticsearch configuration, looking for memory allocation issues. I changed the garbage collectors just in case. I downsized the minimum and maximum heap allocation. It took a long time. I had to wait a few hours to see how the memory use was growing and I had many things to manage aside. Note: run your Elasticsearch node with mlockall, you’ll see memory issues quickly.

I was really upset because it was not supposed to eat more than 4GB RAM (+ more or less 200MB of non heap allocation). My graphs showed the heap was not taking more than allocated at runtime, and there was something like 40MB of non heap memory used as well.

It was my second mistake. I focused on memory because the process was eating lots of RAM. Spoiler: it wasn’t.

After reading the JVM documentation, I started to focus on non heap memory. Java allocates a pool of memory for every single thread it creates. Before I read that, I had not looked at my threads consumption graphs. The number of active threads was insane. The virtual machine would create up to 20,000 concurrent threads before the machine ran out of memory. None of them was ever closed.

I started to work with my colleague Han, who also has a good knowledge of Elasticsearch. Han was the perfect investigation partner. He’s a smart developper with a good Elasticsearch knowledge so he brought both a fresh look and a different point of of to the problem. Having a look at our centralized syslog server, he noticed strange messages sent by one of the data servers.

The data server was constantly sending auto discovery requests not only to the current master server but also to the old one.

As a consequence, the new master was also sending 1 auto discovery requests per second to that node. Doing so, it was creating a new thread each time that was never closed despite requests timeout. For every opened thread, Elasticsearch was eating a bit more RAM. The data node Java process was stuck in an infinite loop. It was impossible to gracefully restart it and we had to kill it.

I hadn’t looked at the syslog frontend because I SSHed myself on the machine to look at the logs. Han does not have SSH access, so he had to check on the frontend. Doing this, he was able to get a global view of the cluster when I was only focusing on the ill node.

Conclusion

As every time I fuck up at something, I’ve learned some obvious lessons I’m now sharing here.

Start your observation universe wide to see if everything else seems to work correctly. An Elasticsearch cluster is a complex environment. It relies on many interconnected node and the Java Virtual Machine itself is a complex thing.

Don’t make assumptions that the problem comes from somewhere because it’s visible from that place. Gather information from the whole environment even though you’re working on a specific thing. I would have lost less time digging if I had looked at all the system data instead of focusing on the memory.

Don’t wait for a fresh look at your problem, even more if you’re working with people who’ll look at it from a completely different point of view. It helps a lot.

Everyone can make mistakes like this starting with you. Stay humble and learn from your and other people’s mistakes. After you learn that, take a short break, stop looking at the past and focus on the next problem.

Perry the Platypus wants you to subscribe now! Even if you don't visit my site on a regular basis, you can get the latest posts delivered to you for free via Email: