Saving Money with Spot ECS Clusters on AWS

By Marco Montagna on 8 August 2018

Introduction

This post will show you how to save money on AWS infrastructure by running your containerized applications on a hybrid Amazon Elastic Container Service (ECS) cluster which is composed of both cheaper spot instances and more expensive on-demand instances.

Amazon Elastic Container Service (ECS) is a docker container deployment and orchestration system. It supports load balancing, auto scaling and a variety of other features which make it easy for engineers to scale containerized application across hundreds of instances.

Amazon spot instances are reduced rate instances which are run on Amazon’s spare capacity. They are typically discounted 70-90% from the standard on-demand price. In return for providing such a large discount Amazon reserves the right to reclaim those instances with 2 minutes notice.

Traditionally this has made spot instances a good choice for offline batch processing workloads which have no interactive requirements and do not suffer from the occasional service interruption.

Still, despite the risks of service interruptions the dramatic savings offered by spot instances have driven many engineers to find ways to run interactive services, like web servers, on spot instances, and indeed a well designed service can be made to maintain high uptime guarantees when deployed onto hybrid clusters of spot and on-demand instances.

Even better, a recent update by AWS to the spot pricing model has greatly stabilized spot prices and reduced the likelihood of spot service interruptions, making spot instances an even more attractive option for all workloads.

The Stack

We will create two auto scaling groups: one to manage the on-demand instances, and another to manage the spot instances.

Instances in each group will be tagged with a purchase-option ECS attribute, which will allow us to later utilize task placement strategies and constraints in order to ensure that interactive services run some or all tasks on the on-demand instances; we’ll cover this in a later blog post.

Basic Terraform Intro

We will be using Terraform, a tool that lets you describe and deploy infrastructure as code, to setup the cluster infrastructure. If you’re not familiar with Terraform you can read more about it here: https://www.terraform.io/intro/index.html.

If you want to skip the explanations and view the full Terraform script it can be found here. Below we’ll get into the nitty-gritty setup of each individual resource.

Setting Up Terraform Variables and Boilerplate

First we will set the AWS region and subnets into which we will deploy our cluster.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


provider "aws" {
  region     = "us-west-2"
}

# Set the default vpc
data "aws_vpc" "default" {
  default = true
}

# Read all subnet ids for this vpc/region.
data "aws_subnet_ids" "all_subnets" {
  vpc_id = "${data.aws_vpc.default.id}"
}

Here we configure the number and type of instances in the cluster.

16
17
18
19
20
21
22
23
24
25
26
27


# Define some variables we'll use later.
locals {
  instance_type = "m4.large"
  spot_price = "0.10"
  key_name = "solid"
  ecs_cluster_name = "default"
  max_spot_instances = 10
  min_spot_instances = 3

  max_ondemand_instances = 3
  min_ondemand_instances = 3
}

Next we will tell Terraform how to find the most recent ECS AMI.

28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46


# Lookup the current ECS AMI.
# In a production environment you probably want to 
# hardcode the AMI ID, to prevent upgrading to a 
# new and potentially broken release.
data "aws_ami" "ecs" {
  most_recent = true

  filter {
    name   = "name"
    values = ["amzn-ami-*-amazon-ecs-optimized"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }

  owners = ["591542846629"] # Amazon
}

Creating Security Group and IAM Role Plumbing

Now we will create a separate security group for the cluster, which will allow us to restrict access to the cluster’s instances. In this case we’ll enable inbound ssh access from anywhere and outbound connections to anywhere.

49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69


# Create a Security Group with SSH access from the world
resource "aws_security_group" "ecs_cluster" {
  name        = "${local.ecs_cluster_name}_ecs_cluster"
  description = "An ecs cluster"
  vpc_id      = "${data.aws_vpc.default.id}"

  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port       = 0
    to_port         = 0
    protocol        = "-1"
    cidr_blocks     = ["0.0.0.0/0"]
    prefix_list_ids = []
  }
}

Now we’ll create a separate IAM role for the cluster’s instances.

70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89


# Create an IAM role for the ECS instances.
resource "aws_iam_role" "ecs_instance" {
  name  = "ecs_instance"

  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Effect": "Allow",
      "Sid": ""
    }
  ]
}
EOF
}

And assign that IAM role only those IAM permissions necessary for ECS to function.

91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114


# Create and attach an IAM role policy which alllows the necessary
# permissions for the ECS agent to function. 
data "aws_iam_policy_document" "ecs_instance_role_policy_doc" {
  statement {
    actions = [
      "ecs:CreateCluster",
      "ecs:DeregisterContainerInstance",
      "ecs:DiscoverPollEndpoint",
      "ecs:Poll",
      "ecs:RegisterContainerInstance",
      "ecs:StartTelemetrySession",
      "ecs:Submit*",
      "ecr:GetAuthorizationToken",
      "ecr:BatchCheckLayerAvailability",
      "ecr:GetDownloadUrlForLayer",
      "ecr:BatchGetImage",
      "logs:CreateLogStream",
      "logs:PutLogEvents" 
    ]
    resources = [
      "*",
    ]
  }
}

We’ve left out some of the boilerplate Terraform config for attaching the IAM policy above to the IAM role; it can be found in the full Terraform file here

Creating Launch Configurations for Autoscaling Groups

Autoscaling groups use AWS Launch Configurations in order to spin up new instances so we’ll need to create some of those first. This is the launch configuration definition for the spot instances; the on-demand config is similar.

135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153


# Create two launch configs one for ondemand instances and the other for spot.
resource "aws_launch_configuration" "ecs_config_launch_config_spot" {
  name_prefix   = "${local.ecs_cluster_name}_ecs_cluster_spot"
  image_id      = "${data.aws_ami.ecs.id}"
  instance_type = "${local.instance_type}"
  spot_price    = "${local.spot_price}"
  enable_monitoring = true
  lifecycle {
    create_before_destroy = true
  }
  user_data = <<EOF
#!/bin/bash
echo ECS_CLUSTER=${local.ecs_cluster_name} >> /etc/ecs/ecs.config
echo ECS_INSTANCE_ATTRIBUTES={\"purchase-option\":\"spot\"} >> /etc/ecs/ecs.config
EOF
  security_groups = ["${aws_security_group.ecs_cluster.id}"]
  key_name = "${local.key_name}"
  iam_instance_profile = "${aws_iam_instance_profile.ecs_iam_profile.arn}"
}

Note that we leverage the instance user data to specify which ECS cluster instances should join and we also add a custom purchase-option ECS attribute which can be used later to define task placement strategies and can be useful in guaranteeing that at least some containers for interactive services are placed on on-demand instances.

You can also configure other cluster level services here, such as logging or infrastructure monitoring software (eg. New Relic or Datadog).

Creating the Autoscaling Groups

Finally we get to create the auto scaling groups! Here we have the definition for the spot auto scaling group.

186
187
188
189
190
191
192
193
194
195
196


resource "aws_autoscaling_group" "ecs_cluster_spot" {
  name_prefix               = "${aws_launch_configuration.ecs_config_launch_config_spot.name}_ecs_cluster_spot"
  termination_policies = ["OldestInstance"]
  max_size                  = "${local.max_spot_instances}"
  min_size                  = "${local.min_spot_instances}"
  launch_configuration      = "${aws_launch_configuration.ecs_config_launch_config_spot.name}"
  lifecycle {
    create_before_destroy = true
  }
  vpc_zone_identifier       = ["${data.aws_subnet_ids.all_subnets.ids}"]
}

Setting Up Auto Scaling (optional)

If your cluster will experience variable workloads then it makes sense to set an auto scaling policy in order to manage utilization.

Here we have an auto scaling policy for the spot instances which aims to keep the ECS cluster at around 70% memory reservation. You may need to adjust this number depending on the types of services you are deploying, but generally we’ve found that targeting 70% reserved memory provides enough margin for deployments and auto scaling of individual ECS services for clusters with greater than ~10 instances. If you have a cluster of less than 10 instances or you schedule lots of tasks which reserve more than 50% of an instance’s available memory then you may need to maintain more free memory on your cluster, perhaps targeting 50-60% reservation in order to provide enough margin for deployments.

198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217


# Attach an autoscaling policy to the spot cluster to target 70% MemoryReservation on the ECS cluster.
resource "aws_autoscaling_policy" "ecs_cluster_scale_policy" {
  name                   = "${local.ecs_cluster_name}_ecs_cluster_spot_scale_policy"
  policy_type            = "TargetTrackingScaling"
  adjustment_type        = "ChangeInCapacity"
  autoscaling_group_name = "${aws_autoscaling_group.ecs_cluster_spot.name}"

  target_tracking_configuration {
  customized_metric_specification {
    metric_dimension {
      name = "ClusterName"
      value = "${local.ecs_cluster_name}"
    }
    metric_name = "MemoryReservation"
    namespace = "AWS/ECS"
    statistic = "Average"
  }
  target_value = 70.0
  }
}

Deploying with Terraform

Assuming you’ve installed Terraform and have checked out and edited the template here you can deploy a cluster by running:

    terraform init
    terraform apply

Once Terraform is done your new cluster should be ready for you to deploy new tasks and services onto it!