一道有趣的亚马逊面试题

一个朋友投亚马逊的云计算岗位，在二面结束的时候对方出了一道面试题，给两天时间做，然后再去进行第三面面试。也就是说这道题是三面的面试题，由于是Hadoop，云计算相关的题目，因此朋友求助于我。我花了一天左右的时间终于帮其解答出来了，由于这道题目设计的十分精巧，解答过程也是像福尔摩斯一样千回百转，最终柳暗花明，不得不感慨外企招人的严谨，因此将自己的解题过程分享成这篇文章。

面试题目如下

这道题的网址链接：请点击此处。

由于担心其链接失效，同时将完整题目拷贝如下：

0x1 Customer question

Hi Support Team.

We met with problem when creating EMR cluster with emrfs-site in this China AWS account.
We use the following command to launch the EMR cluster.

$ aws emr create-cluster --applications Name=Hadoop --ec2-attributes '{"KeyName":"emr-hire","InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-8265d4e6","EmrManagedSlaveSecurityGroup":"sg-b96a61dd","EmrManagedMasterSecurityGroup":"sg-646d6600"}' --release-label emr-5.12.0 --log-uri 's3n://aws-logs-368436158483-cn-north-1/elasticmapreduce/' --instance-groups '[{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":1}]},"InstanceGroupType":"MASTER","InstanceType":"m4.large","Name":"Master - 1"},{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":1}]},"InstanceGroupType":"CORE","InstanceType":"m4.large","Name":"Core - 2"}]' --configurations '[{"Classification":"emrfs-site","Properties":{"fs.s3.cse.enabled":"true","fs.s3.cse.encryptionMaterialsProvider.uri":"s3://emrhiretest/emrhire.jar","fs.s3.customAWSCredentialsProvider":"com.liulishuo.data.LLSAWSCredentialsProvider"},"Configurations":[]}]' --auto-scaling-role EMR_AutoScaling_DefaultRole --ebs-root-volume-size 10 --service-role EMR_DefaultRole --enable-debugging --name 'emr-hire' --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region cn-north-1

{
“ClusterId”: “j-2U472I5YFBLC5”
}

But when query the cluster status, it was failed to launch with error message:
“On the master instance (i-039aa2cc9a1167f99), bootstrap action 1 returned a non-zero return code”

We suspect the issue was caused by the related configure in the launch command, but we cannot fix the issue.

[{"Classification":"emrfs-site","Properties":{"fs.s3.cse.enabled":"true","fs.s3.cse.encryptionMaterialsProvider.uri":"s3://emrhiretest/emrhire.jar","fs.s3.customAWSCredentialsProvider":"com.liulishuo.data.LLSAWSCredentialsProvider"},"Configurations":[]}]

Please help us investigate the issue.

0x2 Resource You Got

working instance IP: 54.223.217.12
ssh key: https://s3.cn-north-1.amazonaws.com.cn/emrhiretest/emr-hire.pem
You can use the instance to read the error log and run EMR launch command to reproduce and fix the issue.
$ ssh -i emr-hire.pem ec2-user@54.223.217.12 to login

The related documents are start point for you reference, you may need read more than these to get the problem solved.

0x4 Questions

Thinking about these questions before interview, we may not ask exactly the same question, most of the time we will ask related questions to see how you deep dive the problem and your quick learning potential.

What Does Customer trying to do? What is a EMRFS? what does EMRFS credential provider do ?
What are the relationship between s3n s3a s3 and emrfs.
What does EMR configuration do and how it works
What does bootstrap action do and how it works, system use which user to run bootstrap script during boot, root, ec2-user or hadoop, how did you find out it.
What does script-runner do and how it works, if the script is store on S3, how did EMR run the script.
What is the different between bootstrap action and script-runner in EMR.
If customer configure both bootstrap action and EMR configuration which will run first.
Why does customer EMR cluster failed to launch, and how to fix the issue.

解答过程

这道题目给我们描述了一个场景：就是某个亚马逊云计算的用户，采用命令一键启动集群，结果发现集群启动失败了，需要我们根据系统返回的日志，找到问题的原因，并帮助用户解决这个问题。

第一步：登录机器，重现问题

大概理解题目意思之后，首先登录到题目指定的机器上面，用同样上面的命令再输入一遍，发现返回的结果与题目描述一致，集群启动失败。但是在输入命令里面，我们发现了这样一句话--log-uri 's3n://aws-logs-368436158483-cn-north-1/elasticmapreduce/'，用来指定了日志的输出位置，于是将日志拉取到本地。

第二步：初步查看日志

拉取到本地之后发现有如下四个目录：

第一个目录里面有一个hadoop-kms的文件夹，里面是关于kms的一些日志，由此想到可能和身份认证有关。
第二个目录里面是一个master结点的日志，内容如下：

显示的是，一共启动了三个结点的集群，一个主结点，两个从结点，从节点先启动，然后由于引导程序报错，又失败了，两个worker结点均失败了。
第三个目录比较重要，分别是如下三个日志文件：

我们在stderr里面发现了这样的错误：

也就是说在将文件分发至各worker结点的过程中，拷贝失败了。
同时通过stdout日志

我们发现实际上，在集群启动的过程中安装了很多的上层开发包的。

在安装软件过程后面，也是报了上面那样的错。
第四个目录里面是一些其他的日志：

第三步：继续分析日志

通过对日志的大致分析，由于是分发文件到work结点的时候报错，加上前面有hadoop-kms等，因此考虑到基于身份的访问控制问题，因此我一直在官方文档里面找修改权限的配置项，很不幸，翻遍了官方文档也没有找到。

第四步：重新回到问题

阅读官方文档之后，对题目中的几个基本概念有了自己的一些理解。

第一个问题是关于对集群启动顺序的理解，应该是先启动集群，然后运行引导程序，最后安装上层软件。这个错误是在引导程序中报的错。
第二个问题，EC2、s3、EMR之间的关系。这三个都是文件系统，EC2就是亚马逊的虚拟机，用的是本地文件系统，s3和HDFS一样，是分布式文件系统，EMR是在s3的基础上做的一些封装和定制化修改，以方便在上层搭建服务，例如hadoop-kms等。
第三个问题，亚马逊在启动集群的命令中对用户的配置进行了再一次的封装，也就是说用户修改的配置实际上是安装一些软件的配置。
第四个问题，emrhire.jar这个jar包文件主要实现了四个类： Encryption、ProviderType、CustomProviderClass 和 CustomProviderLocation，可以理解为实现了一个密钥系统，主要目的是在s3和EMR之间实现加密传输。

第五步：换个思路

之前一直考虑的是在提交命令行中给予权限配置，但是没有这样的配置给我们走这条路。因此，想到引导程序是在配置软件之前做的，在配置软件中既然没有权限，在创建集群并初始化的时候，很可能有权限。因此可以将第三步提到第二步上面去做，以解决这个问题。
于是写了一个这样的脚本copy_jar_file.sh,内容如下：

1	sudo aws s3 cp s3://emrhiretest/emrhire.jar /usr/share/aws/emr/emrfs/auxlib/

我们希望把这个脚本加到引导程序里面去执行。可是发现，本地的脚本文件根本没有权限上传到s3上面去。

第六步，问题解决

然后我们查看s3目录上面有些什么东西，结果令人惊喜，

也就是说出题者早就给我们准备好了一个现成的脚本bootregion.sh，在那里等着，哈哈。

于是将提交命令修改为如下：

aws emr create-cluster --applications Name=Hadoop --bootstrap-actions '[{"Path":"s3://emrhiretest/emrhire.jar"}]' --ec2-attributes '{"KeyName":"emr-hire","InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-8265d4e6","EmrManagedSlaveSecurityGroup":"sg-b96a61dd","EmrManagedMasterSecurityGroup":"sg-646d6600"}' --release-label emr-5.12.0 --log-uri 's3n://aws-logs-368436158483-cn-north-1/elasticmapreduce/' --instance-groups '[{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":1}]},"InstanceGroupType":"MASTER","InstanceType":"m4.large","Name":"Master - 1"},{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":1}]},"InstanceGroupType":"CORE","InstanceType":"m4.large","Name":"Core - 2"}]' --configurations '[{"Classification":"emrfs-site","Properties":{"fs.s3.cse.enabled":"true","fs.s3.customAWSCredentialsProvider":"com.liulishuo.data.LLSAWSCredentialsProvider"},"Configurations":[]}]' --auto-scaling-role EMR_AutoScaling_DefaultRole --ebs-root-volume-size 10 --service-role EMR_DefaultRole --enable-debugging --name 'emr-hire' --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region cn-north-1

成功创建集群，问题得到解决。

总结

这道面试题，非常考验被面试者对Hadoop集群及大数据和分布式云计算实际理解深度以及熟练程度，非常考察被面试者的综合能力水平，包括定位问题，解决问题的能力。如果没有亲自搭建集群等的经验，这题还是很难下手的。在解答这道题目的过程中感谢田兴邦同学的支持，也从侧面说明，前年暑假我在百度实习的时候，搭建了Hadoop和Spark集群，这个经验还是很有用的。

【版权声明】
本文首发于戚名钰的博客，欢迎转载，但是必须保留本文的署名戚名钰（包含链接）。如您有任何商业合作或者授权方面的协商，请给我留言：qimingyu.security@foxmail.com
欢迎关注我的微信公众号：科技锐新

本文永久链接：http://qimingyu.github.io/2018/04/03/一道有趣的亚马逊面试题/