Air balloon

If you’re following Ansible development, you’ve certainly noticed the new replace_all_instances directive in their Autoscaling Group module. If you haven’t yet, you’re definitely missing the most exciting thing for immutable deployment since immutable deployment itself.

Ansible is a platform orchestrator which has been taking a strong cloud orientation since release 1.6. It’s made to run commands on a group of machines called  inventory. A set of commands is called a  role, and a group of roles can be gathered into a playbook. Roles are written using YAML, and they’re mostly a description of what you want Ansible to do.

At Botify, we’re using Ansible for many things.

We describe machines deployment process in playbooks to automate the process. We run parallel commands on a subset of the platform. And we manage the whole EC2 orchestration. This is where replace_all_instances comes.

Immutable deployment is a paradigm born with virtualization. Instead of building and upgrading physical machines, you build a master image you spawn as many time as needed. When you need to update a machine, you build a new image and trash the existing. Immutable deployment has allowed many fancy things, starting with the concept of autoscaling: adding and removing machines from your platform according to defined thresholds and health checks.

The 2 main problems with immutable deployment are usually dealing with databases and transition between 2 versions. You’ll usually use a version of blue / green deployment, but they all have their drawbacks, the biggest one being DNS TTL.

Ansible 1.8 brings a very nice option called replace_all_instances which in a game changing when talking about smooth deployment in an immutable environment.

In a rolling fashion, replace all instances with an old launch configuration with one from the current launch configuraiton.

To understand how it works, you need to understand how Amazon Autoscaling Groups works.

An Autoscaling Group is in charge of maintaining a defined number of virtual machines according to some pre defined thresholds and health checks. It runs a Launch Configuration, which describes the AWS instance you want to run: size, region, image to use, IAM etc…

Let’s create an example to make it clearer:

My webserver autoscaling group runs 2 to 10 virtual machines.

  • The virtual machines are placed behind a load balancer with the following configuration: if a http request on port 80 does not include « hello », the machine is considered dead and replaced.
  • If the machine CPUs is used at 95% during 3 10 minutes periods, it launches a new machine, then waits for 10 minutes before spawning another one if needed.
  • When the CPU use goes under 50%, it kills a virtual machine, then waits for 15 minutes to kill the next one.
  • The webserver autoscaling group runs the webserver-201411041210 launch configuration.

Everything about the virtual machine itself is in the webserver-201411041210 Launch Configuration.

  • The virtual machine is a m3-xlarge one.
  • It spawns only in the us-east-1a and us-east1b availability zones.
  • It’s a spot instance.
  • It runs under the webserver identify.
  • It uses the webserver-201411041210 image.

If I want to reuse the same Autoscaling Group instead of creating a new one every time I deploy the application, here’s what I can do:

  • Create a new webserver-201411041510 launch configuration.
  • Attach the launch configuration to the webserver autoscaling group.
  • Kill one or more machine within the autoscaling group.
  • Wait for the new machine to spawn.
  • Kill a few other machines.
  • Goto 10 (and be taken by a raptor).

Or you can let Ansible do it for you while you brag about automation at the coffee machine.

The following role is from Ansible immutable servers example. The whole repository and documentation are worth reading if you don’t know which deployment type to chose.

- name: create launch config
  ec2_lc:
    name: "{{ lc_name }}"
    image_id: "{{ image_id }}"
    key_name: "{{ key_name }}"
    region: "{{ region }}"
    security_groups: "{{ lc_security_groups }}"
    instance_type: "{{ instance_type }}"
    assign_public_ip: yes
  tags: launch_config

- name: create autoscale groups
  ec2_asg:
    name: "{{ asg_group_name }}"
    health_check_period: 60
    load_balancers: "{{ load_balancers }}"
    health_check_type: ELB
    availability_zones: "{{ availability_zones | join(',')}}"
    launch_config_name: "{{ lc_name }}"
    min_size: "{{ asg_min_size }}"
    max_size: "{{ asg_max_size }}"
    desired_capacity: "{{ asg_desired_capacity }}"
    region: "{{ region }}"
    replace_all_instances: yes
    replace_batch_size: 2
    vpc_zone_identifier: "{{ asg_subnets | join(',') }}"
  until: asg_result.viable_instances|int >= asg_desired_capacity|int
  delay: 10
  retries: 120
  register: asg_result
  tags: autoscale_group

The above YAML snippets describes the creation of a Launch Config and an Autoscaling Group. Once the Launch config is created, it is included into the Autoscaling Group. 5 lines are really important there.

The first one is replace_all_instances: yes. Once the Launch Group is connected to the Autoscaling Group, all the existing instances will be replaced with what the new Launch Group describes.

replace_batch_size: 2 replaces the instances 2 by 2. If you’re running lots of virtual machines, you’ll want to replace them faster, won’t you?

until: asg_result.viable_instances|int >= asg_desired_capacity|int: do this until you have as many instances passing the health check as defined in your Autoscaling Group definition (which is Elastic Load Balancer).

delay: 10 and retries: 120: do it every 10 seconds, retry 2 minutes later in case of errors.

How does it work? Well everything is in the code so let’s have a looks at it.

props = get_properties(as_group)

[...]

for k in props['instance_facts'].keys():
    if k in instances:
      if  props['instance_facts'][k]['launch_config_name'] != props['launch_config_name']:
          replaceable += 1
if replaceable == 0:
    changed = False
    return(changed, props)

First, Ansible checks if the Autoscaling Group contains instances that don’t belong to the new Launch Config and stores how much instances it needs to replace.

# set temporary settings and wait for them to be reached
as_group.max_size = max_size + batch_size
as_group.min_size = min_size + batch_size
as_group.desired_capacity = desired_capacity + batch_size
as_group.update()

Then it updates the Autoscaling Group so the desired number of instances is raised by replace_batch_size (default 1). That way, AWS spawns the new instances and your platform won’t crumble.

wait_timeout = time.time() + wait_timeout
while wait_timeout > time.time() and min_size + batch_size > props['viable_instances']:
    time.sleep(10)
    as_groups = connection.get_all_groups(names=[group_name])
    as_group = as_groups[0]
    props = get_properties(as_group)
if wait_timeout <= time.time():
    # waiting took too long
    module.fail_json(msg = "Waited too long for instances to appear. %s" % time.asctime())
instances = props['instances']
if replace_instances:
    instances = replace_instances
for i in get_chunks(instances, batch_size):
    replace_batch(connection, module, i)

Now, we wait for the defined timeout to get the viable instances we want to run. This part is mostly querying EC2 API and checking for the number of viable instances.

# return settings to normal
as_group = connection.get_all_groups(names=[group_name])[0]
as_group.max_size = max_size 
as_group.min_size = min_size 
as_group.desired_capacity = desired_capacity
as_group.update()

Last but not least, we return the Autoscaling Group settings to normal.

I’m sure you understand how cool it is, don’t you? Well, it’s not perfect and requires you to do a few things to be really smooth.

First, ensure your load balancer always redirects a client to the same server so he won’t switch between 2 versions of your application. It may seem obvious, but it’s sometimes better to state it.

Second, your database migration must be 1 version backward compatible. I’ll write about it in a next post, but it means your application v0 must run with the database for v1. This is mandatory so you can run both versions for a few minutes and rollback is needed.

Here we are, I’m done with this. I hope you liked it and will give Ansible EC2 a try, it’s really worth it despite using that horrible YAML syntax and some bloody limitations.

Perry the Platypus wants you to subscribe now! Even if you don't visit my site on a regular basis, you can get the latest posts delivered to you for free via Email: