简单的YARN基准TestDFSIO失败

too*_*oom 2 hadoop exception hadoop-yarn

我在双节点集群上设置了hadoop.第一个节点"namenode"运行以下守护程序:

hadoop@namenode:~$ jps
2916 SecondaryNameNode
2692 NameNode
3159 NodeManager
5834 Jps
2771 DataNode
3076 ResourceManager
Run Code Online (Sandbox Code Playgroud)

秒节点"datanode"运行以下守护程序:

hadoop@datanode:~$ jps
2559 Jps
2087 DataNode
2198 NodeManager
Run Code Online (Sandbox Code Playgroud)

/etc/hosts我在BOTH机器上添加的文件中:

10.240.40.246 namenode
10.240.172.201 datanode
Run Code Online (Sandbox Code Playgroud)

这是相应的ips,我检查我可以从每台机器ssh到任何其他机器.现在,我想通过执行示例map reduce基准测试来测试我的hadoop安装:

hadoop@namenode:~$ hadoop jar /opt/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.2.0-tests.jar TestDFSIO -write -nrFiles 20 -fileSize 10
Run Code Online (Sandbox Code Playgroud)

但是这个工作失败了:

14/02/17 22:22:53 INFO fs.TestDFSIO: TestDFSIO.1.7
14/02/17 22:22:53 INFO fs.TestDFSIO: nrFiles = 20
14/02/17 22:22:53 INFO fs.TestDFSIO: nrBytes (MB) = 10.0
14/02/17 22:22:53 INFO fs.TestDFSIO: bufferSize = 1000000
14/02/17 22:22:53 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO
14/02/17 22:22:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/02/17 22:22:55 INFO fs.TestDFSIO: creating control file: 10485760 bytes, 20 files
14/02/17 22:22:56 INFO fs.TestDFSIO: created control files for: 20 files
14/02/17 22:22:56 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/02/17 22:22:56 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/02/17 22:22:57 INFO mapred.FileInputFormat: Total input paths to process : 20
14/02/17 22:22:57 INFO mapreduce.JobSubmitter: number of splits:20
14/02/17 22:22:57 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
14/02/17 22:22:57 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
14/02/17 22:22:57 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
14/02/17 22:22:57 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
14/02/17 22:22:57 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
14/02/17 22:22:57 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
14/02/17 22:22:57 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/02/17 22:22:57 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
14/02/17 22:22:57 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
14/02/17 22:22:57 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
14/02/17 22:22:58 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1392675199090_0001
14/02/17 22:22:59 INFO impl.YarnClientImpl: Submitted application application_1392675199090_0001 to ResourceManager at /0.0.0.0:8032
14/02/17 22:22:59 INFO mapreduce.Job: The url to track the job: http://namenode.c.forward-camera-473.internal:8088/proxy/application_1392675199090_0001/
14/02/17 22:22:59 INFO mapreduce.Job: Running job: job_1392675199090_0001
14/02/17 22:23:10 INFO mapreduce.Job: Job job_1392675199090_0001 running in uber mode : false
14/02/17 22:23:10 INFO mapreduce.Job:  map 0% reduce 0%
14/02/17 22:23:42 INFO mapreduce.Job:  map 20% reduce 0%
14/02/17 22:23:43 INFO mapreduce.Job:  map 30% reduce 0%
14/02/17 22:24:14 INFO mapreduce.Job:  map 60% reduce 0%
14/02/17 22:24:41 INFO mapreduce.Job:  map 60% reduce 20%
14/02/17 22:24:45 INFO mapreduce.Job:  map 85% reduce 20%
14/02/17 22:24:48 INFO mapreduce.Job:  map 85% reduce 28%
14/02/17 22:24:59 INFO mapreduce.Job:  map 90% reduce 28%
14/02/17 22:25:00 INFO mapreduce.Job:  map 90% reduce 30%
14/02/17 22:25:02 INFO mapreduce.Job:  map 100% reduce 30%
14/02/17 22:25:03 INFO mapreduce.Job:  map 100% reduce 100%
14/02/17 22:25:16 INFO mapreduce.Job:  map 0% reduce 0%
14/02/17 22:25:16 INFO mapreduce.Job: Job job_1392675199090_0001 failed with state FAILED due to: Application application_1392675199090_0001 failed 2 times due to AM Container for appattempt_1392675199090_0001_000002 exited with  exitCode: 1 due to: Exception from container-launch: 
org.apache.hadoop.util.Shell$ExitCodeException: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
    at org.apache.hadoop.util.Shell.run(Shell.java:379)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)


.Failing this attempt.. Failing the application.
14/02/17 22:25:16 INFO mapreduce.Job: Counters: 0
java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
    at org.apache.hadoop.fs.TestDFSIO.runIOTest(TestDFSIO.java:443)
    at org.apache.hadoop.fs.TestDFSIO.writeTest(TestDFSIO.java:425)
    at org.apache.hadoop.fs.TestDFSIO.run(TestDFSIO.java:755)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at org.apache.hadoop.fs.TestDFSIO.main(TestDFSIO.java:650)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
    at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
    at org.apache.hadoop.test.MapredTestDriver.run(MapredTestDriver.java:115)
    at org.apache.hadoop.test.MapredTestDriver.main(MapredTestDriver.java:123)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Run Code Online (Sandbox Code Playgroud)

看一下我在机器上找到的日志文件datanode:

hadoop@datanode:/opt/hadoop-2.2.0/logs$ cat yarn-hadoop-nodemanager-datanode.log
...
2014-02-17 22:29:33,432 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8031. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
Run Code Online (Sandbox Code Playgroud)

在我的名字节上,我做了:

hadoop@namenode:/opt/hadoop-2.2.0/logs$ cat yarn-hadoop-*log
2014-02-17 22:13:20,833 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: STARTUP_MSG: 
...
2014-02-17 22:13:25,240 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: The Auxilurary Service named 'mapreduce_shuffle' in the configuration is for class class org.apache.hadoop.mapred.ShuffleHandler which has a name of 'httpshuffle'. Because these are not the same tools trying to send ServiceData and read Service Meta Data may have issues unless the refer to the name in the config.
...
2014-02-17 22:13:25,505 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: NodeManager configured with 8 G physical memory allocated to containers, which is more than 80% of the total physical memory available (3.6 G). Thrashing might happen.
...
2014-02-17 22:24:48,779 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1392675199090_0001_01_000023
2014-02-17 22:24:48,779 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1392675199090_0001_01_000024
...
2014-02-17 22:25:15,733 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1392675199090_0001_02_000001 is : 1
2014-02-17 22:25:15,734 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1392675199090_0001_02_000001 and exit code: 1
org.apache.hadoop.util.Shell$ExitCodeException: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
    at org.apache.hadoop.util.Shell.run(Shell.java:379)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
...
2014-02-17 22:25:15,736 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container exited with a non-zero exit code 1
...
2014-02-17 22:25:15,751 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hadoop   OPERATION=Container Finished - Failed   TARGET=ContainerImpl    RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE    APPID=application_1392675199090_0001    CONTAINERID=container_1392675199090_0001_02_000001
...
2014-02-17 22:13:19,150 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: STARTUP_MSG: 
...
2014-02-17 22:25:15,837 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop   OPERATION=Application Finished - Failed TARGET=RMAppManager RESULT=FAILURE  DESCRIPTION=App failed with state: FAILED   PERMISSIONS=Application application_1392675199090_0001 failed 2 times due to AM Container for appattempt_1392675199090_0001_000002 exited with  exitCode: 1 due to: Exception from container-launch: 
org.apache.hadoop.util.Shell$ExitCodeException: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
    at org.apache.hadoop.util.Shell.run(Shell.java:379)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)


.Failing this attempt.. Failing the application.    APPID=application_1392675199090_0001
Run Code Online (Sandbox Code Playgroud)

但是,我检查namenode了端口8031正在侦听的机器.我明白了:

hadoop@namenode:~$ netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp        0      0 namenode.c.forwar:36975 metadata.google.in:http TIME_WAIT  
tcp        0      0 namenode.c.forwar:36969 metadata.google.in:http TIME_WAIT  
tcp        0      0 namenode.c.forwar:40616 namenode.c.forwar:10001 TIME_WAIT  
tcp        0      0 namenode.c.forwar:36974 metadata.google.in:http ESTABLISHED
tcp        0      0 namenode.c.forward:8031 namenode.c.forwar:41229 ESTABLISHED
tcp        0    352 namenode.c.forward-:ssh e178064245.adsl.a:64305 ESTABLISHED
tcp        0      0 namenode.c.forwar:41229 namenode.c.forward:8031 ESTABLISHED
tcp        0      0 namenode.c.forwar:40365 namenode.c.forwar:10001 ESTABLISHED
tcp        0      0 namenode.c.forwar:10001 namenode.c.forwar:40365 ESTABLISHED
tcp        0      0 namenode.c.forwar:10001 datanode:48786          ESTABLISHED
Active UNIX domain sockets (w/o servers)
Proto RefCnt Flags       Type       State         I-Node   Path
unix  10     [ ]         DGRAM                    4604     /dev/log
unix  2      [ ]         STREAM     CONNECTED     10490    
unix  2      [ ]         STREAM     CONNECTED     10488    
unix  2      [ ]         STREAM     CONNECTED     10452    
unix  2      [ ]         STREAM     CONNECTED     8452     
unix  2      [ ]         STREAM     CONNECTED     7800     
unix  2      [ ]         STREAM     CONNECTED     7797     
unix  2      [ ]         STREAM     CONNECTED     6762     
unix  2      [ ]         STREAM     CONNECTED     6702     
unix  2      [ ]         STREAM     CONNECTED     6698     
unix  2      [ ]         STREAM     CONNECTED     6208     
unix  2      [ ]         DGRAM                    5750     
unix  2      [ ]         DGRAM                    5737     
unix  2      [ ]         DGRAM                    5734     
unix  3      [ ]         STREAM     CONNECTED     5643     
unix  3      [ ]         STREAM     CONNECTED     5642     
unix  2      [ ]         DGRAM                    5640     
unix  2      [ ]         DGRAM                    5192     
unix  2      [ ]         DGRAM                    5171     
unix  2      [ ]         DGRAM                    4889     
unix  2      [ ]         DGRAM                    4723     
unix  2      [ ]         DGRAM                    4663     
unix  3      [ ]         DGRAM                    3132     
unix  3      [ ]         DGRAM                    3131     
Run Code Online (Sandbox Code Playgroud)

那么,这可能是什么问题.在我看来一切都很好.为什么我的工作失败呢?

too*_*oom 6

登录datanode

Retrying connect to server: 0.0.0.0/0.0.0.0:8031
Run Code Online (Sandbox Code Playgroud)

所以它试图连接到本地机器上的这个端口datanode.但是,该服务仍在运行namenode.因此,必须添加以下配置行yarn-site.xml

  <property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>namenode:8031</value>
  </property>
  <property>
    <name>yarn.resourcemanager.address</name>
    <value>namenode:8032</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>namenode:8030</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address</name>
    <value>namenode:8033</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>namenode:8088</value>
  </property>
Run Code Online (Sandbox Code Playgroud)

其中namenode/etc/hosts运行资源管理器守护程序的计算机的别名.