Spark on AWS EMR¶
Key Links¶
Create a EMR Cluster with Spark using the AWS Console¶
The following procedure creates a cluster with Spark installed.
- Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.
-
Choose Create cluster to use Quick Create.
-
For the Software Configuration field, choose Amazon Release Version emr-5.0.0 or later.
- In the Select Applications field, choose either All Applications or Spark.
- Select other options as necessary and then choose Create cluster
Create a EMR Cluster with Spark using the AWS CLI¶
Simple cluster:
aws emr create-cluster --name "Spark cluster" --release-label emr-5.0.0 --applications Name=Spark \
--ec2-attributes KeyName=myKey --instance-type m3.xlarge --instance-count 3 --use-default-roles
Note: For Windows, replace the above Linux line continuation character () with the caret (^).
When using a config file:
aws emr create-cluster --release-label --applications Name=Spark \
--instance-type m3.xlarge --instance-count 3 --configurations https://s3.amazonaws.com/mybucket/myfolder/myConfig.json
Sample myConfig.json:
Using Spot instances:
aws emr create-cluster --name "Spot cluster" --release-label emr-5.0.0 --applications Name=Spark \
--use-default-roles --ec2-attributes KeyName=myKey \
--instance-groups InstanceGroupType=MASTER,InstanceType=m3.xlarge,InstanceCount=1,BidPrice=0.25 \
InstanceGroupType=CORE,BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2
# InstanceGroupType=TASK,BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3
In Java:
// start Spark on EMR in java
AmazonElasticMapReduceClient emr = new AmazonElasticMapReduceClient(credentials);
Application sparkApp = new Application() .withName("Spark");
Applications myApps = new Applications();
myApps.add(sparkApp);
RunJobFlowRequest request = new RunJobFlowRequest() .withName("Spark Cluster") .withApplications(myApps) .withReleaseLabel("") .withInstances(new JobFlowInstancesConfig() .withEc2KeyName("myKeyName") .withInstanceCount(1) .withKeepJobFlowAliveWhenNoSteps(true) .withMasterInstanceType("m3.xlarge") .withSlaveInstanceType("m3.xlarge") ); RunJobFlowResult result = emr.runJobFlow(request);
Connect to the Master Node using SSH¶
To connect to the master node using SSH, you need the public DNS name of the master node and your Amazon EC2 key pair private key. The Amazon EC2 key pair private key is specified when you launch the cluster.
- To retrieve the cluster identifier / the public DNS name of the master node, type the following command:
The output lists your clusters including the cluster IDs. Note the cluster ID for the cluster to which you are connecting.
"Status": { "Timeline": { "ReadyDateTime": 1408040782.374, "CreationDateTime": 1408040501.213 }, "State": "WAITING", "StateChangeReason": { "Message": "Waiting after step completed" } }, "NormalizedInstanceHours": 4,"Id": "j-2AL4XXXXXX5T9", "Name": "My cluster"
- To list the cluster instances including the master public DNS name for the cluster, type one of the following commands. Replace j-2AL4XXXXXX5T9 with the cluster ID returned by the previous command.
aws emr list-instances --cluster-id j-2AL4XXXXXX5T9Or:aws emr describe-clusters --cluster-id j-2AL4XXXXXX5T9
View the Web Interfaces Hosted on Amazon EMR Clusters¶
-
YARN ResourceManager: https://master-public-dns-name:8088
- YARN NodeManager: https://slave-public-dns-name:8042
- Hadoop HDFS NameNode: https://master-public-dns-name:50070
- Hadoop HDFS DataNode: https://slave-public-dns-name:50075
- Spark HistoryServer: https://master-public-dns-name:18080
- Zeppelin: https://master-public-dns-name:8890
- Hue: https://master-public-dns-name:8888
- Ganglia: https://master-public-dns-name/ganglia
- HBase UI: https://master-public-dns-name:16010