How do you run Spark jobs with Amazon EMR on Amazon EKS?💡
Introduction
Amazon EMR on EKS provides a deployment option for Amazon EMR that allows you to run open-source big data frameworks on Amazon Elastic Kubernetes Service (Amazon EKS).
With this deployment option, you can focus on running analytics workloads while Amazon EMR on EKS builds, configures, and manages containers for open-source applications.
The following diagram shows the two different deployment models for Amazon EMR.
Amazon EMR on EKS loosely couples applications to the infrastructure that they run on. With this loose coupling of services, you can run multiple, securely isolated jobs simultaneously. You can also benchmark the same job with different compute backends or spread your job across multiple Availability Zones to improve availability.
The following diagram illustrates how Amazon EMR on EKS works with other AWS services.
How the components work together?
The following steps and diagram illustrate the Amazon EMR on EKS workflow:
- Use an existing Amazon EKS cluster or create one using the eksctl command line utility or Amazon EKS console.
- Create a virtual cluster by registering Amazon EMR with a namespace on an EKS cluster.
- Submit your job to the virtual cluster using the AWS CLI or SDK.
Demo
Setting up Amazon EMR on EKS
Go to AWS CloudShell by clicking here: link
Install eksctl
Download and extract the latest release of eksctl with the following command.
curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
Move the extracted binary to /usr/local/bin.
sudo mv /tmp/eksctl /usr/local/bin
Test that your installation was successful with the following command.You must have eksctl 0.167.0 version or later.
eksctl version
Set up an Amazon EKS cluster
Create an EKS cluster - Run the following command to create an EKS cluster and nodes.
eksctl create cluster --name demo-cluster --region us-east-1 --with-oidc --ssh-access --ssh-public-key demo-sanchit --instance-types=m5.xlarge --managed
Note:
- Replace my-cluster and myKeyPair with your own cluster name and key pair name.
- Replace us-west-2 with the Region where you want to create your cluster.
View and validate resources - Run the following command to view your cluster nodes.
kubectl get nodes -o wide
Enable cluster access for Amazon EMR on EKS
Create a Kubernetes namespace by running the following
kubectl create ns emr-on-eks-demo
Allow Amazon EMR on EKS access to a specific namespace in your cluster by taking the following actions
eksctl create iamidentitymapping --cluster demo-cluster --namespace emr-on-eks-demo --service-name "emr-containers"
Note: - Replace demo-cluster with the name of your Amazon EKS cluster and replace emr-on-eks-demo with the Kubernetes namespace created to run Amazon EMR workloads.
Enable IAM Roles for Service Accounts (IRSA) on the EKS cluster
AWS CLI command to retrieve OpenID Connect issuer URL
aws eks describe-cluster --name demo-cluster --query "cluster.identity.oidc.issuer" --output text
Create an IAM OIDC identity provider for your cluster with eksctl
eksctl utils associate-iam-oidc-provider --cluster demo-cluster --approve
Note: - Replace demo-cluster with the name of your Amazon EKS cluster.
Create a job execution role
To run workloads on Amazon EMR on EKS, you need to create an IAM role.
The following policy for the job execution role allows access to resource targets, Amazon S3, and CloudWatch. These permissions are necessary to monitor jobs and access logs.
Creating an IAM role with trust policy
cat <<EoF > ./emr-trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "elasticmapreduce.amazonaws.com" }, "Action": "sts:AssumeRole" } ] } EoF aws iam create-role --role-name EMRContainers-JobExecutionRole --assume-role-policy-document file://./emr-trust-policy.json
Attach role to above creating IAM role
cat <<EoF > ./EMRContainers-JobExecutionRole.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:ListBucket" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "logs:PutLogEvents", "logs:CreateLogStream", "logs:DescribeLogGroups", "logs:DescribeLogStreams" ], "Resource": [ "arn:aws:logs:*:*:*" ] } ] } EoF aws iam put-role-policy --role-name EMRContainers-JobExecutionRole --policy-name EMR-Containers-Job-Execution --policy-document file://./EMRContainers-JobExecutionRole.json
Update the trust policy of the job execution role
When you use IAM Roles for Service Accounts (IRSA) to run jobs on a Kubernetes namespace, an administrator must create a trust relationship between the job execution role and the identity of the EMR managed service account
Run the following command to update the trust policy.
aws emr-containers update-role-trust-policy --cluster-name demo-cluster --namespace emr-on-eks-demo --role-name EMRContainers-JobExecutionRole
Register the Amazon EKS cluster with Amazon EMR
Use the following command to create a virtual cluster with a name of your choice for the Amazon EKS cluster and namespace that you set up in previous steps.
aws emr-containers create-virtual-cluster --name emr-on-eks-demo --container-provider '{ "id": "demo-cluster", "type": "EKS", "info": { "eksInfo": { "namespace": "emr-on-eks-demo" } } }'
Submit a Spark job run with StartJobRun
Now let’s run a sample workload using one of the inbuilt example scripts that calculates the value of pi.
First get the virtual EMR clusters id and arn of the role that EMR uses for job execution.
export VIRTUAL_CLUSTER_ID=$(aws emr-containers list-virtual-clusters --query "virtualClusters[?state=='RUNNING'].id" --output text) export EMR_ROLE_ARN=$(aws iam get-role --role-name EMRContainers-JobExecutionRole --query Role.Arn --output text)
Let’s start a sample spark job.
aws emr-containers start-job-run --virtual-cluster-id=$VIRTUAL_CLUSTER_ID --name=pi-2 --execution-role-arn=$EMR_ROLE_ARN \ --release-label=emr-6.15.0-latest \ --job-driver='{ "sparkSubmitJobDriver": { "entryPoint": "local:///usr/lib/spark/examples/src/main/python/pi.py", "sparkSubmitParameters": "--conf spark.executor.instances=1 --conf spark.executor.memory=2G --conf spark.executor.cores=1 --conf spark.driver.cores=1" } }'
Resources
- Visit this page to find the latest documentation.