Kubernetes on AWS
WORK IN PROGRESS
This repo contains configuration templates to provision
Kubernetes clusters on AWS using Cloud
Formation and Ubuntu
Linux.
Many values are parameterized and values are not always visible. We're
focusing on solving our own, specific/Zalando use case. However, we
are open to ideas from the community at large about potentially turning
this idea into a project that provides universal/general value to
others. Please contact us via our Issues Tracker with your thoughts
and suggestions.
Configuration in this repository initially was based on
kube-aws, but now
depends on four components which aren't all yet open sourced:
- Cluster Registry to keep desired cluster states (e.g. used config
channel and version) - Cluster Lifecycle
Manager
to provision the cluster's Cloud Formation stack and apply
Kubernetes manifests for system components - Cluster Lifecycle Controller that handles rolling updates from
inside the cluster, for example node termination - Authnz Webhook to validate OAuth tokens and authorize access
Lean more about Zalando's cloud native journey by reading the Zalando
Case Study on
kubernetes.io. See our
Running Kubernetes in Production on AWS
document
for details on the setup.
Features
- Highly available master nodes (ASG) behind ELB
- Worker Auto Scaling Group with node pools support
- Flannel overlay networking
- Cluster autoscaling (using
cluster-autoscaler) - Kubernetes DNS with node-local dnsmasq as daemonset and CoreDNS
resolver for cluster.local domain running in the same pod. - Route53 DNS integration via External
DNS - AWS IAM integration via
kube2iam, AWS OIDC
IAM - Standard components are installed: node exporter,
kube-state-metrics, see also
cluster/manifests
directory - Webhook authentication and authorization (roles "ReadOnly",
"PowerUser", "Manual", "Emergency", "Administrator") - Emergency Access via internal emergency-access-service, that grant
roles "Manual" and "Emergency" with 4 eyes principle and audit
logging - Log shipping via Scalyr
- Full Ingress support with ALB/NLB and TLS integration via
kube-ingress-aws-controller
and HTTP routing via skipper - Enhanced usability with managed stacks and blue green deployments
via
stackset-controller
and skipper - Fabric API
Gateway, which
can be used in combination with
stackset-controller - Static Egress IPs to route through NAT Gateways with Elastic IPs via
kube-static-egress-controller - Horizontal Pod Autoscaling with scaling by request per second, SQS
queue size or others via
kube-metrics-adapter - Vertical Pod Autoscaling to scale for example Prometheus
- EFS support
- GPU support
- ETCD backup via Kubernetes cronjob and etcdctl snapshot and upload
to S3 - Monitoring via Prometheus and OpenTracing
- Fully automated cluster updates via Cluster Lifecycle
Manager - Automated downscaling for test clusters with
kube-downscaler - Fallback node pools
- Spot node pool integration
- automated PDB creation with
pdb-controller
Notes
- Node and user authentication is done via tokens (using the webhook
feature) - SSL client-cert authentication is disabled
- Many values are hardcoded
- Secrets (e.g. shared token) are not KMS-encrypted in the cluster
Assumptions
- The AWS account has one or more hosted zones in Route53 including a
proper SSL cert (you can use the free ACM service) - The VPC has at least one public subnet per AZ (either AWS default
VPC setup or public subnet named "dmz-<REGION>-<AZ>") - The VPC is in region eu-central-1 or eu-west-1
- etcd cluster is available via DNS discovery (SRV records) at
etcd.<YOUR-HOSTED-ZONE> - OAuth Token
Info
is available to validate user tokens
Directory Structure
- cluster/cluster.yaml: Cloud Formation template files for the cluster
(will be applied by Cluster Lifecycle
Manager) - cluster/config-defaults.yaml: Default values for different kind of
use that can be overridden by values from our cluster-registry (will
be applied by Cluster Lifecycle
Manager) - cluster/etcd-cluster.yaml: Senza Cloud Formation to deploy ETCD
- cluster/manifests: Kubernetes manifests for system components (will
be applied by Cluster Lifecycle
Manager) - cluster/node-pools: Cloud Formation template files and userdata
(cloud-init) for ContainerLinux node-pools (will be applied by
Cluster Lifecycle
Manager) - docs: extracts from internal [Zalando documentation]{.title-ref}.