Sanchit Dilip Jain/How do you run Flink job with Amazon EMR on Amazon EKS?💡

Created Wed, 27 Dec 2023 12:00:00 +0000 Modified Sun, 12 May 2024 01:47:18 +0000
900 Words 4 min

Introduction

  • Amazon EMR on EKS provides a deployment option for Amazon EMR that allows you to run open-source big data frameworks on Amazon Elastic Kubernetes Service (Amazon EKS).

  • With this deployment option, you can focus on running analytics workloads while Amazon EMR on EKS builds, configures, and manages containers for open-source applications.

  • The following diagram shows the two different deployment models for Amazon EMR.

  • Amazon EMR on EKS loosely couples applications to the infrastructure that they run on. With this loose coupling of services, you can run multiple, securely isolated jobs simultaneously. You can also benchmark the same job with different compute backends or spread your job across multiple Availability Zones to improve availability.

  • The following diagram illustrates how Amazon EMR on EKS works with other AWS services.

Overview of Apache Flink

  • Apache Flink is an open-source, unified stream processing and batch processing framework that was designed to process large amounts of data. It provides fast, reliable, and scalable data processing with fault tolerance and exactly-once semantics.

  • Some of the key features of Flink are:

    • Distributed Processing
    • Stream Processing and Batch Processing
    • Fault Tolerance
    • Exactly-once Semantics
    • Low Latency
    • Extensibility

Demo

  • Setting up Amazon EMR on EKS

    • Go to AWS CloudShell by clicking here: link

    • Install eksctl

      • Download and extract the latest release of eksctl with the following command.

        curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
        
      • Move the extracted binary to /usr/local/bin.

        sudo mv /tmp/eksctl /usr/local/bin
        
      • Test that your installation was successful with the following command.You must have eksctl 0.167.0 version or later.

        eksctl version
        

    • Set up an Amazon EKS cluster

      • Create an EKS cluster - Run the following command to create an EKS cluster and nodes.

        eksctl create cluster --name demo-cluster --region us-east-1 --with-oidc --ssh-access --ssh-public-key demo-sanchit --instance-types=m5.xlarge --managed
        

      Note:

      • Replace my-cluster and myKeyPair with your own cluster name and key pair name.
      • Replace us-west-2 with the Region where you want to create your cluster.

      • View and validate resources - Run the following command to view your cluster nodes.

        kubectl get nodes -o wide
        
    • Enable cluster access for Amazon EMR on EKS

      • Create a Kubernetes namespace by running the following

        kubectl create ns emr-on-eks-demo
        
      • Allow Amazon EMR on EKS access to a specific namespace in your cluster by taking the following actions

        eksctl create iamidentitymapping  --cluster demo-cluster --namespace emr-on-eks-demo --service-name "emr-containers"
        

      Note: - Replace demo-cluster with the name of your Amazon EKS cluster and replace emr-on-eks-demo with the Kubernetes namespace created to run Amazon EMR workloads.

    • Enable IAM Roles for Service Accounts (IRSA) on the EKS cluster

      • AWS CLI command to retrieve OpenID Connect issuer URL

        aws eks describe-cluster --name demo-cluster --query "cluster.identity.oidc.issuer" --output text
        

      • Create an IAM OIDC identity provider for your cluster with eksctl

        eksctl utils associate-iam-oidc-provider --cluster demo-cluster --approve
        

      Note: - Replace demo-cluster with the name of your Amazon EKS cluster.

    • Create a job execution role

      • To run workloads on Amazon EMR on EKS, you need to create an IAM role.

      • The following policy for the job execution role allows access to resource targets, Amazon S3, and CloudWatch. These permissions are necessary to monitor jobs and access logs.

      • Creating an IAM role with trust policy

         cat <<EoF > ./emr-trust-policy.json
         {
           "Version": "2012-10-17",
           "Statement": [
             {
               "Effect": "Allow",
               "Principal": {
                 "Service": "elasticmapreduce.amazonaws.com"
               },
               "Action": "sts:AssumeRole"
             }
           ]
         }
         EoF
        
         aws iam create-role --role-name EMRContainers-JobExecutionRole --assume-role-policy-document file://./emr-trust-policy.json
        

      • Attach role to above creating IAM role

         cat <<EoF > ./EMRContainers-JobExecutionRole.json
         {
             "Version": "2012-10-17",
             "Statement": [
                 {
                     "Effect": "Allow",
                     "Action": [
                         "s3:PutObject",
                         "s3:GetObject",
                         "s3:ListBucket"
                     ],
                     "Resource": "*"
                 },
                 {
                     "Effect": "Allow",
                     "Action": [
                         "logs:PutLogEvents",
                         "logs:CreateLogStream",
                         "logs:DescribeLogGroups",
                         "logs:DescribeLogStreams"
                     ],
                     "Resource": [
                         "arn:aws:logs:*:*:*"
                     ]
                 }
             ]
         }  
         EoF
        
         aws iam put-role-policy --role-name EMRContainers-JobExecutionRole --policy-name EMR-Containers-Job-Execution --policy-document file://./EMRContainers-JobExecutionRole.json
        

    • Update the trust policy of the job execution role

      • When you use IAM Roles for Service Accounts (IRSA) to run jobs on a Kubernetes namespace, an administrator must create a trust relationship between the job execution role and the identity of the EMR managed service account

      • Run the following command to update the trust policy.

        aws emr-containers update-role-trust-policy --cluster-name demo-cluster --namespace emr-on-eks-demo --role-name EMRContainers-JobExecutionRole
        

    • Register the Amazon EKS cluster with Amazon EMR

      • Use the following command to create a virtual cluster with a name of your choice for the Amazon EKS cluster and namespace that you set up in previous steps.

         aws emr-containers create-virtual-cluster --name emr-on-eks-demo --container-provider '{
             "id": "demo-cluster",
             "type": "EKS",
             "info": {
                 "eksInfo": {
                     "namespace": "emr-on-eks-demo"
                 }
             }
         }'
        

  • Install Flink Kubernetes operator for Amazon EMR on EKS

    • Install the Helm chart on Amazon EKS.

      curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 > get_helm.sh
      chmod 700 get_helm.sh
      ./get_helm.sh
      

    • Install Flink Kubernetes Operator using Helm chart

      export VERSION=6.15.0 # The Amazon EMR release version
      export NAMESPACE=emr-on-eks-demo
      
      helm install flink-kubernetes-operator-demo oci://public.ecr.aws/emr-on-eks/flink-kubernetes-operator --version $VERSION --namespace $NAMESPACE
      

    • Install the cert-manager (once per Amazon EKS cluster) to enable adding the webhook component.

      kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.12.0/cert-manager.yaml
      

    • You should see the following message when deployment is complete.

      kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.12.0/cert-manager.yaml
      

  • Execute Sample Flink job with Managed Node Groups and Cluster Autoscaler

    • Submit a sample Flink job.

      kubectl apply -f https://raw.githubusercontent.com/sanchitdilipjain/sanchitdilipjain.github.io/main/resources/flink-job/flink-sample-job.yaml
      

    • Monitor the job status using the below command. You should see the new nodes triggered by the Cluster Autoscaler and the YuniKorn will schedule one Job manager pod and one Taskmanager pods on this node.

      kubectl get pods -n emr-on-eks-demo
      kubectl get deployment -n emr-on-eks-demo
      kubectl describe deployment demo-flink-job -n emr-on-eks-demo
      

Resources

  • Visit this page to find the latest documentation.