>[!info]
>This was last updated in 2024.
# Introduction
Kubernetes is great for a larger company, however sometimes you want to keep the upkeep and overhead down. This is probably where ECS, Elastic Container Service, shines. I currently work for a company who's entire engineering team is about 12 people, me being the sole person on the more "infrastructure" side of things. Part of my goal was to balance cost, with going public requirements, and maintainability.
ECS is AWS's platform for container management. Tasks in ECS are equivalent to pods in Kubernetes.
# Deployments
The examples and steps below are for GitHub Actions but obviously you can probably apply this to whatever system you're using.
1. Pull down the `task-definition.json`
```shell
aws ecs describe-task-definition \
--task-definition "${{ inputs.environment }}-${{ inputs.app_name }}" \
--region ${{ env.REGION }} --query taskDefinition > task-definition.json
```
2. Combine the task definitions with new info
```yaml
- name: Fill in the new image ID in the Amazon ECS task definition
id: task-def
uses: aws-actions/amazon-ecs-render-task-definition@v1
env:
VERSION: ${{ env.VERSION }}
with:
task-definition: task-definition.json
container-name: ${{ inputs.app_name }}
image: "${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ env.REGION }}.amazonaws.com/${{ inputs.app_name }}:${{ inputs.version || env.VERSION }}"
environment-variables: |
DD_VERSION=${{ inputs.version || env.VERSION }}
```
3. Deploy the new task definition
```yaml
- name: Deploy Amazon ECS task definition
uses: aws-actions/amazon-ecs-deploy-task-definition@v1
with:
task-definition: ${{ steps.task-def.outputs.task-definition }}
service: "${{ inputs.environment }}-${{ inputs.app_name }}"
cluster: "${{ inputs.environment }}-ecs-cluster"
wait-for-service-stability: true
```
# Fargate
Fargate is pretty nice, especially for java apps, because you can make use of a larger % of your system resources compared to a standard instance. Generally the rule of thumb is no more than 50% of your resources goes to the JVM so a 4GB instance needs 8GB of RAM minimum (and usually you want to go even lower if your app isn't well optimized). With Fargate you can more closely align your resources with the app's needs.
## Cost
Most of the cost is on the CPU side so keep that in mind.
> [!example] Estimates
> CPU Calculation: $(0.04048 * 730) * (Number of CPUs)$
> Comes out to about **$30/m** per vCPU.
>
> Memory Calculation: $(0.004445 * 730) * (GBs of RAM)$
> Comes out to about **$3/m** per 1GB of RAM.
Can get further cost savings with an AWS Savings plan. Also one note while on a 1-1 basis this may appear to be more expensive than an EC2 instance, please keep in mind that you're saving costs on managing the security of those EC2 instance's packages and access.
## Task and Service CPU/Memory
With Fargate you match CPU with a memory amount, here's table below of the current options. ^[https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-cpu-memory-error.html]
| **CPU** | **Memory** | Other Notes |
| :-----: | :--------: | ------------- |
| 1 | 2-8 | 1GB Increment |
| 2 | 4-16 | 1GB Increment |
| 4 | 8-30 | 1GB Increment |
| 8 | 16-60 | 4GB Increment |
| 16 | 32-120 | 8GB Increment |
> [!note]
Needs to use binary value of Memory:
> $
> GB*(2^{10}) = Binary GB
> $
> Example $6 * (2^{10}) = 6144$
# Observability
In ECS a `task` is a one to a set of containers running in that task. To get logging and metrics from your main app container there'll need to be 2 "sidecars" (containers that live along side your app container in the task).
- [log-router](https://github.com/aws/aws-for-fluent-bit)
- [datadog-agent](https://github.com/DataDog/datadog-agent)
You do lose the ability (as of early 2024) to put the logs into Cloudwatch by doing this.
## Datadog
While this is specifically about Datadog, I imagine this applies to all other observability tooling plus or minus some small details.
### Logging
Logging can entirely be handled by the sidecar Datadog agent. With the Example below it will automatically pull in all logs from the container. You need to use the `awsfirelens` log driver and `amazon/aws-for-fluent-bit`.
### APM
This depends on the application language, I would recommend looking at Datadog's documentation for your [language](https://docs.datadoghq.com/tracing/trace_collection/automatic_instrumentation/dd_libraries/). Because you setup the datadog-agent as a sidecar everything should tie into itself without too much effort.
### Example
As an example above in a terraform template with [terraform-aws-modules/ecs/aws//modules/service](https://registry.terraform.io/modules/terraform-aws-modules/ecs/aws/5.9.1)@ v5.9.1
Container Definition
```json
container_definitions = {
(local.app) = {
cpu = var.app_cpu
memory = var.app_memory
essential = true
enable_autoscaling = true
autoscaling_min_capacity = 1
autoscaling_max_capacity = 2
autoscaling_policies = {}
image = "${ACCOUNT_ID}.dkr.ecr.${data.aws_region.current.name}.amazonaws.com/${data.aws_ecr_image.app_image.repository_name}:${data.aws_ecr_image.app_image.image_tag}"
readonly_root_filesystem = false
log_configuration = {
logDriver = "awsfirelens"
options = {
Name = "datadog"
Host = "http-intake.logs.datadoghq.com"
apikey = jsondecode(data.aws_secretsmanager_secret_version.selected.secret_string)["DD_API_KEY"]
dd_service = local.app
dd_source = local.dd_source
dd_message_key = "log"
dd_tags = "host:${local.env}-ecs,env:${local.env}"
TLS = "on"
provider = "ecs"
retry_limit = "2"
}
}
health_check = {
retries = 10
command = ["CMD-SHELL", "curl -f http://localhost:${local.app_port}${local.health_check} || exit 1"]
timeout : 5
interval : 10
}
port_mappings = [
{
name = local.app
containerPort = local.app_port
hostPort = local.app_port
protocol = "http"
}
]
environment = local.env_vars\
secrets = local.secret_vars
docker_labels = {
"com.datadoghq.tags.service" : local.app,
"com.datadoghq.tags.env" : local.env,
"com.datadoghq.ad.logs" : "[{\"source\": \"${local.dd_source}\", \"service\": \"${local.app}\"}]"
}
},
datadog-agent = {
image = "public.ecr.aws/datadog/agent:latest"
cpu = 256
memory = 512
essential = true
readonly_root_filesystem = false\
environment = [
{ name = "ECS_FARGATE", value = "true" },
{ name = "DD_APM_ENABLED", value = local.dd_apm },
{ name = "DD_LOGS_ENABLED", value = "true" },
{ name = "DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL", value = "true" },
{ name = "DD_CONTAINER_ENV_AS_TAGS", value = "" }
]
secrets = [
{ name = "DD_API_KEY", valueFrom = data.aws_secretsmanager_secret.selected.arn }
]
port_mappings = [
{
protocol = "tcp"
containerPort = 8126
}
]
},
log-router = {
image = "amazon/aws-for-fluent-bit:stable"
essential = true
firelens_configuration = {
type = "fluentbit"
options = {
enable-ecs-log-metadata = "true"
config-file-type = "file"
config-file-value = "/fluent-bit/configs/parse-json.conf"
}
}
memory_reservation = 50
}
}
```
# Common tasks
## Restart
```shell
aws ecs update-service --cluster test-ecs-cluster --service test-app --region region-1 --force-new-deployment
```
---
# Resources
- https://aws.amazon.com/blogs/containers/graceful-shutdowns-with-ecs/
- https://docs.github.com/en/actions/deployment/deploying-to-your-cloud-provider/deploying-to-amazon-elastic-container-service
- https://docs.aws.amazon.com/AmazonECS/latest/developerguide/Welcome.html