You know, for search

Elasticsearch is one of my favorite piece of software. I’ve been using it since 0.11 and deployed every version since 0.17.6 in production. However, I must admit it’s sometimes a pain in the ass to manage. It can behave unexpectedly and either vomit gigabytes in your logs, or stay desperately silent.

One of those strange behaviour can happen after one or more data nodes are restarted. In a cluster running either lots of nodes or lots of shards, the post restart shard allocation can take forever and never end.

This post is about investigating and eventually fixing this behaviour. You don’t need to have a deep knowledge of Elasticsearch to understand it.

Red is dead

The first evidence something’s wrong comes from the usual cluster state query.

curl -XGET http://localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "myescluster",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 16,
  "active_primary_shards" : 2558,
  "active_shards" : 5628,
  "relocating_shards" : 0,
  "initializing_shards" : 4,
  "unassigned_shards" : 22
}  

This is all but good.

The status: red is a sign your cluster is missing some primary shards. It means some data are still completely missing. As a consequence, queries on these data will fail and indexing will take a tremendous amount of time. When all the primary shards are back, the cluster switches in yellow to warn you it’s still working but your data is present.

initializing_shards: 4 is what your cluster is currently doing: bringing back your data to life.

unassigned_shards: 22 is where your lost primary shards are. The more you have there, the more data you’re missing and the more conflict you’re likely to meet.

Baby come back, won’t you pleaaaase come back?

What happened there?

When a data node leaves the cluster and comes back, Elasticsearch will bring the data back and merge the records that may have been written during the time that node was away. Since there may be lots of new data, the process can take forever, even more when some shards fail at starting.

Let’s run another query to understand the cluster state a bit better.

curl -XGET http://localhost:9200/_cat/shards
t37       434 p STARTED 12221982  13.8gb 10.0.0.22   datanode02
t37       434 r STARTED 12221982  13.8gb 10.0.0.23   datanode03
t37       403 p INITIALIZING 21620252  28.3gb 10.0.0.22   datanode02
t37       404 p INITIALIZING 5720596    4.9gb 10.0.0.22   datanode02
t37       20  p UNASSIGNED
t37       20  r UNASSIGNED
t37       468 p INITIALIZING  8313898  12.3gb 10.0.0.22   datanode02
t37       470 p INITIALIZING  38868416 56.8gb 10.0.0.22   datanode02
t37       37  p UNASSIGNED
t37       37  r UNASSIGNED
t37       430 r STARTED 11806144  15.8gb 10.0.0.24   datanode04
t37       430 p STARTED 11806144  15.8gb 10.0.0.25   datanode05
t37       368 p STARTED 34530372    43gb 10.0.0.25   datanode05
t37       368 r STARTED 34530372    43gb 10.0.0.23   datanode03
...

This is indeed a sample of the real output, which is actually 5628 lines long. There’s a few interesting things here I want to show you.

Every line is under the same form.

The first field, t37, is the index name. This one is obviously called t37, and it has a huge number of primary shard to store the gazillon posts I’ve written over the years.

The second field is the shard number, followed by either p if the shard is a primary one, or r if it’s a replica.

The fourth field is the shard state. It’s either UNASSIGNED, INITIALIZING or STARTED. The first 2 states are the ones we’re insterested with. If I run the previous query | grep INITIALIZING and | grep UNASSIGNED, I’ll get 4 and 22 lines respectively.

If you need a fix…

First, let’s take care of the 22 unassigned shards since they’re the easiest to fix. What we’re going to do is force the shards allocation. Since we’re lazy, let’s do some scripting here.

for shard in $(curl -XGET http://localhost:9200/_cat/shards | grep UNASSIGNED | awk '{print $2}'); do
    curl -XPOST 'localhost:9200/_cluster/reroute' -d '{
        "commands" : [ {
              "allocate" : {
                  "index" : "t37", 
                  "shard" : $shard, 
                  "node" : "datanode15", 
                  "allow_primary" : true
              }
            }
        ]
    }'
    sleep 5
done

What we’re doing here is forcing every unassigned shard allocation on datanode15. Depending on the shards size, you’ll probably have to assign them in various nodes. If you’re playing with very small shards, don’t worry, Elasticsearch will reallocate them for you once they’re up.

Let’s run the cluster health query again, will you?

curl -XGET http://localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "myescluster",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 16,
  "active_primary_shards" : 2568,
  "active_shards" : 5638,
  "relocating_shards" : 0,
  "initializing_shards" : 16,
  "unassigned_shards" : 0
}  

Great news, isn’t it? All our previously unassigned shards are now active or assigned. Unfortunately, things aren’t fixed yet.

I still haven’t found what I’m looking for

Let’s run that query once more a few minutes later.

curl -XGET http://localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "myescluster",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 16,
  "active_primary_shards" : 2580,
  "active_shards" : 5650,
  "relocating_shards" : 0,
  "initializing_shards" : 4,
  "unassigned_shards" : 0
}  

See? There are 4 shards still initializing. Guess what? I’m sure they’re the ones the first query already spotted. Let’s query the shard listing again.

curl -XGET http://localhost:9200/_cat/shards | grep INIT
t37       403 p INITIALIZING 21620252  28.3gb 10.0.0.22   datanode02
t37       404 p INITIALIZING 5720596    4.9gb 10.0.0.22   datanode02
t37       468 p INITIALIZING  8313898  12.3gb 10.0.0.22   datanode02
t37       470 p INITIALIZING  38868416 56.8gb 10.0.0.22   datanode02

Interesting isn’t it? Every stuck shard are on the same node. This is not really a surprise though.

If we try to run the reroute query, it will fail because the primary shard is not yet active.

[2015-01-25 11:45:14,569][DEBUG][action.admin.cluster.reroute] [masternode01] failed to perform [cluster_reroute (api)]
org.elasticsearch.ElasticsearchIllegalArgumentException: [allocate] allocation of [t37][403] on node [datanode02][uvM8ncAjTs2_HER9k6DMow][datanode02.t37.net][inet[/10.0.0.22:9300]]{master=false} is not allowed, reason: [YES(shard is not allocated to same node or host)][YES(node passes include/exclude/require filters)][NO(primary shard is not yet active)][YES(below shard recovery limit of [2])][YES(allocation disabling is ignored)][YES(allocation disabling is ignored)][YES(no allocation awareness enabled)][YES(total shard limit disabled: [-1] <= 0)][YES(no active primary shard yet)][YES(disk usages unavailable)][YES(shard not primary or relocation disabled)]
  at org.elasticsearch.cluster.routing.allocation.command.AllocateAllocationCommand.execute(AllocateAllocationCommand.java:221)
  at org.elasticsearch.cluster.routing.allocation.command.AllocationCommands.execute(AllocationCommands.java:119)
  at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:131)
  at org.elasticsearch.action.admin.cluster.reroute.TransportClusterRerouteAction$1.execute(TransportClusterRerouteAction.java:91)
  at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:328)
  at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:153)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)

Then, what can we do? Unfortunately not much except restarting the node. Hopefully there’s only one of them.

Tell me whyyyyyyyyy

You and me together, fighting for our cluster

If you’re asking, yes, I love Jimmy Sommerville, but that’s not the point.

There’s some literature about shards never initializing and cluster restart taking ages on the Web. From what I’ve understood, such a stuck state can be reached when some shards fail to start and nothing’s done about it. Unfortunately, I didn’t find anything in my logs related to the shards that were stuck, so I have no idea how it happened.

That’s all folks, this post about tracking and fixing your shard allocations issues is over. I hope you enjoyed reading it as much as I enjoyed spending my weekend trying to fix it. Next time, we’ll see how pink unicorns are your personal development worst ennemies. Stay tuned!

Perry the Platypus wants you to subscribe now! Even if you don't visit my site on a regular basis, you can get the latest posts delivered to you for free via Email: