Spark on AWS EMR¶

Key Links¶

Create a EMR Cluster with Spark using the AWS Console¶

The following procedure creates a cluster with Spark installed.

Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.
Choose Create cluster to use Quick Create.
For the Software Configuration field, choose Amazon Release Version emr-5.0.0 or later.
In the Select Applications field, choose either All Applications or Spark.
Select other options as necessary and then choose Create cluster

Create a EMR Cluster with Spark using the AWS CLI¶

Simple cluster:

aws emr create-cluster --name "Spark cluster" --release-label emr-5.0.0 --applications Name=Spark \
--ec2-attributes KeyName=myKey --instance-type m3.xlarge --instance-count 3 --use-default-roles

Note: For Windows, replace the above Linux line continuation character () with the caret (^).

When using a config file:

aws emr create-cluster --release-label --applications Name=Spark \
--instance-type m3.xlarge --instance-count 3 --configurations https://s3.amazonaws.com/mybucket/myfolder/myConfig.json

Sample myConfig.json:

[
  {
    "Classification": "spark",
    "Properties": {
      "maximizeResourceAllocation": "true"
    }
  }
]

Using Spot instances:

aws emr create-cluster --name "Spot cluster" --release-label emr-5.0.0 --applications Name=Spark \
--use-default-roles --ec2-attributes KeyName=myKey \
--instance-groups InstanceGroupType=MASTER,InstanceType=m3.xlarge,InstanceCount=1,BidPrice=0.25 \
InstanceGroupType=CORE,BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2

# InstanceGroupType=TASK,BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3

In Java:

// start Spark on EMR in java
AmazonElasticMapReduceClient emr = new AmazonElasticMapReduceClient(credentials);
Application sparkApp = new Application() .withName("Spark");
Applications myApps = new Applications();
myApps.add(sparkApp);
RunJobFlowRequest request = new RunJobFlowRequest() .withName("Spark Cluster") .withApplications(myApps) .withReleaseLabel("") .withInstances(new JobFlowInstancesConfig() .withEc2KeyName("myKeyName") .withInstanceCount(1) .withKeepJobFlowAliveWhenNoSteps(true) .withMasterInstanceType("m3.xlarge") .withSlaveInstanceType("m3.xlarge") ); RunJobFlowResult result = emr.runJobFlow(request);

Connect to the Master Node using SSH¶

To connect to the master node using SSH, you need the public DNS name of the master node and your Amazon EC2 key pair private key. The Amazon EC2 key pair private key is specified when you launch the cluster.

To retrieve the cluster identifier / the public DNS name of the master node, type the following command:

aws emr list-clusters

The output lists your clusters including the cluster IDs. Note the cluster ID for the cluster to which you are connecting.

"Status": {     "Timeline": {         "ReadyDateTime": 1408040782.374,         "CreationDateTime": 1408040501.213     },     "State": "WAITING",     "StateChangeReason": {         "Message": "Waiting after step completed"     } }, "NormalizedInstanceHours": 4,"Id": "j-2AL4XXXXXX5T9", "Name": "My cluster"

To list the cluster instances including the master public DNS name for the cluster, type one of the following commands. Replace j-2AL4XXXXXX5T9 with the cluster ID returned by the previous command.

aws emr list-instances --cluster-id j-2AL4XXXXXX5T9Or:aws emr describe-clusters --cluster-id j-2AL4XXXXXX5T9

View the Web Interfaces Hosted on Amazon EMR Clusters¶

View Web Interfaces Hosted on Amazon EMR Clusters
YARN ResourceManager: https://master-public-dns-name:8088
YARN NodeManager: https://slave-public-dns-name:8042
Hadoop HDFS NameNode: https://master-public-dns-name:50070
Hadoop HDFS DataNode: https://slave-public-dns-name:50075
Spark HistoryServer: https://master-public-dns-name:18080
Zeppelin: https://master-public-dns-name:8890
Hue: https://master-public-dns-name:8888
Ganglia: https://master-public-dns-name/ganglia
HBase UI: https://master-public-dns-name:16010