Sanchit Dilip Jain/How do you run Spark jobs with Amazon EMR on Amazon EKS?💡

How do you run Spark jobs with Amazon EMR on Amazon EKS?💡

Introduction

Amazon EMR on EKS provides a deployment option for Amazon EMR that allows you to run open-source big data frameworks on Amazon Elastic Kubernetes Service (Amazon EKS).
With this deployment option, you can focus on running analytics workloads while Amazon EMR on EKS builds, configures, and manages containers for open-source applications.
The following diagram shows the two different deployment models for Amazon EMR.
Amazon EMR on EKS loosely couples applications to the infrastructure that they run on. With this loose coupling of services, you can run multiple, securely isolated jobs simultaneously. You can also benchmark the same job with different compute backends or spread your job across multiple Availability Zones to improve availability.
The following diagram illustrates how Amazon EMR on EKS works with other AWS services.

How the components work together?

The following steps and diagram illustrate the Amazon EMR on EKS workflow:
- Use an existing Amazon EKS cluster or create one using the eksctl command line utility or Amazon EKS console.
- Create a virtual cluster by registering Amazon EMR with a namespace on an EKS cluster.
- Submit your job to the virtual cluster using the AWS CLI or SDK.

Demo

Setting up Amazon EMR on EKS

Go to AWS CloudShell by clicking here: link
Install eksctl
- Download and extract the latest release of eksctl with the following command.
```
curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
```
- Move the extracted binary to /usr/local/bin.
```
sudo mv /tmp/eksctl /usr/local/bin
```
- Test that your installation was successful with the following command.You must have eksctl 0.167.0 version or later.
```
eksctl version
```
Set up an Amazon EKS cluster
- Create an EKS cluster - Run the following command to create an EKS cluster and nodes.
```
eksctl create cluster --name demo-cluster --region us-east-1 --with-oidc --ssh-access --ssh-public-key demo-sanchit --instance-types=m5.xlarge --managed
```
Note:
- Replace my-cluster and myKeyPair with your own cluster name and key pair name.
- Replace us-west-2 with the Region where you want to create your cluster.
- View and validate resources - Run the following command to view your cluster nodes.
```
kubectl get nodes -o wide
```
Enable cluster access for Amazon EMR on EKS
- Create a Kubernetes namespace by running the following
```
kubectl create ns emr-on-eks-demo
```
- Allow Amazon EMR on EKS access to a specific namespace in your cluster by taking the following actions
```
eksctl create iamidentitymapping  --cluster demo-cluster --namespace emr-on-eks-demo --service-name "emr-containers"
```
Note: - Replace demo-cluster with the name of your Amazon EKS cluster and replace emr-on-eks-demo with the Kubernetes namespace created to run Amazon EMR workloads.
Enable IAM Roles for Service Accounts (IRSA) on the EKS cluster
- AWS CLI command to retrieve OpenID Connect issuer URL
```
aws eks describe-cluster --name demo-cluster --query "cluster.identity.oidc.issuer" --output text
```
- Create an IAM OIDC identity provider for your cluster with eksctl
```
eksctl utils associate-iam-oidc-provider --cluster demo-cluster --approve
```
Note: - Replace demo-cluster with the name of your Amazon EKS cluster.

Create a job execution role

To run workloads on Amazon EMR on EKS, you need to create an IAM role.
The following policy for the job execution role allows access to resource targets, Amazon S3, and CloudWatch. These permissions are necessary to monitor jobs and access logs.

Creating an IAM role with trust policy

cat <<EoF > ./emr-trust-policy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "elasticmapreduce.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EoF

aws iam create-role --role-name EMRContainers-JobExecutionRole --assume-role-policy-document file://./emr-trust-policy.json

Attach role to above creating IAM role

cat <<EoF > ./EMRContainers-JobExecutionRole.json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:PutLogEvents",
                "logs:CreateLogStream",
                "logs:DescribeLogGroups",
                "logs:DescribeLogStreams"
            ],
            "Resource": [
                "arn:aws:logs:*:*:*"
            ]
        }
    ]
}  
EoF

aws iam put-role-policy --role-name EMRContainers-JobExecutionRole --policy-name EMR-Containers-Job-Execution --policy-document file://./EMRContainers-JobExecutionRole.json

Update the trust policy of the job execution role
- When you use IAM Roles for Service Accounts (IRSA) to run jobs on a Kubernetes namespace, an administrator must create a trust relationship between the job execution role and the identity of the EMR managed service account
- Run the following command to update the trust policy.
```
aws emr-containers update-role-trust-policy --cluster-name demo-cluster --namespace emr-on-eks-demo --role-name EMRContainers-JobExecutionRole
```

Register the Amazon EKS cluster with Amazon EMR

Use the following command to create a virtual cluster with a name of your choice for the Amazon EKS cluster and namespace that you set up in previous steps.

 aws emr-containers create-virtual-cluster --name emr-on-eks-demo --container-provider '{
     "id": "demo-cluster",
     "type": "EKS",
     "info": {
         "eksInfo": {
             "namespace": "emr-on-eks-demo"
         }
     }
 }'

Submit a Spark job run with StartJobRun

Now let’s run a sample workload using one of the inbuilt example scripts that calculates the value of pi.

First get the virtual EMR clusters id and arn of the role that EMR uses for job execution.

export VIRTUAL_CLUSTER_ID=$(aws emr-containers list-virtual-clusters --query "virtualClusters[?state=='RUNNING'].id" --output text)
export EMR_ROLE_ARN=$(aws iam get-role --role-name EMRContainers-JobExecutionRole --query Role.Arn --output text)

Let’s start a sample spark job.

aws emr-containers start-job-run --virtual-cluster-id=$VIRTUAL_CLUSTER_ID --name=pi-2 --execution-role-arn=$EMR_ROLE_ARN \
 --release-label=emr-6.15.0-latest \
 --job-driver='{
    "sparkSubmitJobDriver": {
    "entryPoint": "local:///usr/lib/spark/examples/src/main/python/pi.py",
    "sparkSubmitParameters": "--conf spark.executor.instances=1 --conf spark.executor.memory=2G --conf spark.executor.cores=1 --conf spark.driver.cores=1"
        }
}'

Resources

Visit this page to find the latest documentation.