Sanchit Dilip Jain/How to scale Spark jobs via Karpenter with Amazon EMR on Amazon EKS?💡

How do you run Spark jobs with Amazon EMR on Amazon EKS?💡

Introduction

Amazon EMR on EKS provides a deployment option for Amazon EMR that allows you to run open-source big data frameworks on Amazon Elastic Kubernetes Service (Amazon EKS).
With this deployment option, you can focus on running analytics workloads while Amazon EMR on EKS builds, configures, and manages containers for open-source applications.
The following diagram shows the two different deployment models for Amazon EMR.
Amazon EMR on EKS loosely couples applications to the infrastructure that they run on. With this loose coupling of services, you can run multiple, securely isolated jobs simultaneously. You can also benchmark the same job with different compute backends or spread your job across multiple Availability Zones to improve availability.
The following diagram illustrates how Amazon EMR on EKS works with other AWS services.

Scaling with Karpenter

Karpenter is a dynamic, high performance, open-source cluster auto-scaling solution for Kubernetes. Karpenter works by:
- Watching for pods marked as unschedulable
- Evaluating scheduling constraints (resource requests, nodeselectors, affinities, tolerations, and topology spread constraints) requested by the pods
- Provisioning nodes that meet the requirements of the pods
- Removing the nodes when the nodes are no longer needed
Karpenter’s job is to add nodes to handle unschedulable pods, schedule pods on those nodes, and remove the nodes when they are not needed.
To configure Karpenter, you create provisioners that define how Karpenter manages unschedulable pods and expires nodes.
The following diagram illustrates how it works:

Demo

Setting up Amazon EMR on EKS
- Go to AWS CloudShell by clicking here: link
- Install eksctl
  - Download and extract the latest release of eksctl with the following command.
```
curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
```
  - Move the extracted binary to /usr/local/bin.
```
sudo mv /tmp/eksctl /usr/local/bin
```
  - Test that your installation was successful with the following command.You must have eksctl 0.167.0 version or later.
```
eksctl version
```
- Install the Helm chart on Amazon EKS.
```
curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 > get_helm.sh
chmod 700 get_helm.sh
./get_helm.sh
```
- Set up an Amazon EKS cluster
  - Create an EKS cluster - Run the following command to create an EKS cluster and nodes.
```
eksctl create cluster --name demo-cluster --region us-east-1 --with-oidc --ssh-access --ssh-public-key demo-sanchit --instance-types=m5.xlarge --managed
```
  Note:
  - Replace demo-cluster and demo-sanchit with your own cluster name and key pair name.
  - Replace us-east-1 with the Region where you want to create your cluster.
  - View and validate resources - Run the following command to view your cluster nodes.
```
kubectl get nodes -o wide
```

Steps to configure Karpenter for your EKS cluster

Unzip karpenter_files.zip and upload the following files into the S3 bucket provided in the following prefixes.

export ACCOUNTID="${ACCOUNTID:-$(aws sts get-caller-identity --query Account --output text)}"
export AWS_REGION="${AWS_REGION:-$(curl -s 169.254.169.254/latest/dynamic/instance-identity/document | jq -r '.region')}"

export S3BUCKET="athena-spark-datastore"
export EKSCLUSTER_NAME="demo-cluster"

curl -O https://raw.githubusercontent.com/sanchitdilipjain/sanchitdilipjain.github.io/main/resources/karpenter/karpenter_files.zip

unzip karpenter_files.zip
aws s3 cp karpenter-driver-pod-template.yaml s3://${S3BUCKET}/pod-template/
aws s3 cp karpenter-executor-pod-template.yaml s3://${S3BUCKET}/pod-template/

Run the karpenter-setup.sh script that will install all the necessary Karpenter components.
./karpenter-setup.sh
Note: If the bucket name displayed within the square brackets is correct, just press ENTER to continue.
Verify that the Karpenter pods are installed in the karpenter namespace
kubectl get pods -n karpenter
Submit a Spark job using the emr6.5-tpcds-karpenter.sh script
./emr6.5-tpcds-karpenter.sh 2 2

Watch and monitor the progress by running the following commands

#This command will show the pods being generated. The first pod will be the job-runner, the second will be the Spark driver pods, and then about 50 executor pods will be scheduled

watch -n1 "kubectl get pod -n emr-karpenter"

If you want to view how Karpenter provisioned the nodes to meet the workload requirements and removed the nodes once the job was completed, you can do the following:

# Locate a Karpenter pod and pull the logs:
kubectl get pods -n karpenter

# Choose one of the pods and pull the logs:
kubectl logs karpenter-7d4f4d7675-ktsdv -n karpenter

Resources

Visit this page to find the latest documentation.