ECS

>[!info] >This was last updated in 2024. # Introduction Kubernetes is great for a larger company, however sometimes you want to keep the upkeep and overhead down. This is probably where ECS, Elastic Container Service, shines. I currently work for a company who's entire engineering team is about 12 people, me being the sole person on the more "infrastructure" side of things. Part of my goal was to balance cost, with going public requirements, and maintainability. ECS is AWS's platform for container management. Tasks in ECS are equivalent to pods in Kubernetes. # Deployments The examples and steps below are for GitHub Actions but obviously you can probably apply this to whatever system you're using. 1. Pull down the `task-definition.json` ```shell aws ecs describe-task-definition \ --task-definition "${{ inputs.environment }}-${{ inputs.app_name }}" \ --region ${{ env.REGION }} --query taskDefinition > task-definition.json ``` 2. Combine the task definitions with new info ```yaml - name: Fill in the new image ID in the Amazon ECS task definition id: task-def uses: aws-actions/amazon-ecs-render-task-definition@v1 env: VERSION: ${{ env.VERSION }} with: task-definition: task-definition.json container-name: ${{ inputs.app_name }} image: "${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ env.REGION }}.amazonaws.com/${{ inputs.app_name }}:${{ inputs.version || env.VERSION }}" environment-variables: | DD_VERSION=${{ inputs.version || env.VERSION }} ``` 3. Deploy the new task definition ```yaml - name: Deploy Amazon ECS task definition uses: aws-actions/amazon-ecs-deploy-task-definition@v1 with: task-definition: ${{ steps.task-def.outputs.task-definition }} service: "${{ inputs.environment }}-${{ inputs.app_name }}" cluster: "${{ inputs.environment }}-ecs-cluster" wait-for-service-stability: true ``` # Fargate Fargate is pretty nice, especially for java apps, because you can make use of a larger % of your system resources compared to a standard instance. Generally the rule of thumb is no more than 50% of your resources goes to the JVM so a 4GB instance needs 8GB of RAM minimum (and usually you want to go even lower if your app isn't well optimized). With Fargate you can more closely align your resources with the app's needs. ## Cost Most of the cost is on the CPU side so keep that in mind. > [!example] Estimates > CPU Calculation: $(0.04048 * 730) * (Number of CPUs)$ > Comes out to about **$30/m** per vCPU. > > Memory Calculation: $(0.004445 * 730) * (GBs of RAM)$ > Comes out to about **$3/m** per 1GB of RAM. Can get further cost savings with an AWS Savings plan. Also one note while on a 1-1 basis this may appear to be more expensive than an EC2 instance, please keep in mind that you're saving costs on managing the security of those EC2 instance's packages and access. ## Task and Service CPU/Memory With Fargate you match CPU with a memory amount, here's table below of the current options. ^[https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-cpu-memory-error.html] | **CPU** | **Memory** | Other Notes | | :-----: | :--------: | ------------- | | 1 | 2-8 | 1GB Increment | | 2 | 4-16 | 1GB Increment | | 4 | 8-30 | 1GB Increment | | 8 | 16-60 | 4GB Increment | | 16 | 32-120 | 8GB Increment | > [!note] Needs to use binary value of Memory: > $ > GB*(2^{10}) = Binary GB > $ > Example $6 * (2^{10}) = 6144$ # Observability In ECS a `task` is a one to a set of containers running in that task. To get logging and metrics from your main app container there'll need to be 2 "sidecars" (containers that live along side your app container in the task). - [log-router](https://github.com/aws/aws-for-fluent-bit) - [datadog-agent](https://github.com/DataDog/datadog-agent) You do lose the ability (as of early 2024) to put the logs into Cloudwatch by doing this. ## Datadog While this is specifically about Datadog, I imagine this applies to all other observability tooling plus or minus some small details. ### Logging Logging can entirely be handled by the sidecar Datadog agent. With the Example below it will automatically pull in all logs from the container. You need to use the `awsfirelens` log driver and `amazon/aws-for-fluent-bit`. ### APM This depends on the application language, I would recommend looking at Datadog's documentation for your [language](https://docs.datadoghq.com/tracing/trace_collection/automatic_instrumentation/dd_libraries/). Because you setup the datadog-agent as a sidecar everything should tie into itself without too much effort. ### Example As an example above in a terraform template with [terraform-aws-modules/ecs/aws//modules/service](https://registry.terraform.io/modules/terraform-aws-modules/ecs/aws/5.9.1)@ v5.9.1 Container Definition ```json container_definitions = { (local.app) = { cpu = var.app_cpu memory = var.app_memory essential = true enable_autoscaling = true autoscaling_min_capacity = 1 autoscaling_max_capacity = 2 autoscaling_policies = {} image = "${ACCOUNT_ID}.dkr.ecr.${data.aws_region.current.name}.amazonaws.com/${data.aws_ecr_image.app_image.repository_name}:${data.aws_ecr_image.app_image.image_tag}" readonly_root_filesystem = false log_configuration = { logDriver = "awsfirelens" options = { Name = "datadog" Host = "http-intake.logs.datadoghq.com" apikey = jsondecode(data.aws_secretsmanager_secret_version.selected.secret_string)["DD_API_KEY"] dd_service = local.app dd_source = local.dd_source dd_message_key = "log" dd_tags = "host:${local.env}-ecs,env:${local.env}" TLS = "on" provider = "ecs" retry_limit = "2" } } health_check = { retries = 10 command = ["CMD-SHELL", "curl -f http://localhost:${local.app_port}${local.health_check} || exit 1"] timeout : 5 interval : 10 } port_mappings = [ { name = local.app containerPort = local.app_port hostPort = local.app_port protocol = "http" } ] environment = local.env_vars\ secrets = local.secret_vars docker_labels = { "com.datadoghq.tags.service" : local.app, "com.datadoghq.tags.env" : local.env, "com.datadoghq.ad.logs" : "[{\"source\": \"${local.dd_source}\", \"service\": \"${local.app}\"}]" } }, datadog-agent = { image = "public.ecr.aws/datadog/agent:latest" cpu = 256 memory = 512 essential = true readonly_root_filesystem = false\ environment = [ { name = "ECS_FARGATE", value = "true" }, { name = "DD_APM_ENABLED", value = local.dd_apm }, { name = "DD_LOGS_ENABLED", value = "true" }, { name = "DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL", value = "true" }, { name = "DD_CONTAINER_ENV_AS_TAGS", value = "" } ] secrets = [ { name = "DD_API_KEY", valueFrom = data.aws_secretsmanager_secret.selected.arn } ] port_mappings = [ { protocol = "tcp" containerPort = 8126 } ] }, log-router = { image = "amazon/aws-for-fluent-bit:stable" essential = true firelens_configuration = { type = "fluentbit" options = { enable-ecs-log-metadata = "true" config-file-type = "file" config-file-value = "/fluent-bit/configs/parse-json.conf" } } memory_reservation = 50 } } ``` # Common tasks ## Restart ```shell aws ecs update-service --cluster test-ecs-cluster --service test-app --region region-1 --force-new-deployment ``` --- # Resources - https://aws.amazon.com/blogs/containers/graceful-shutdowns-with-ecs/ - https://docs.github.com/en/actions/deployment/deploying-to-your-cloud-provider/deploying-to-amazon-elastic-container-service - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/Welcome.html