Hadoop Streaming Job在python中失败了

Yuh*_*ang 5 python hadoop mapreduce

我有一个用Python编写的mapreduce工作.该程序在linux env中成功测试,但在Hadoop下运行时失败了.

这是作业命令:

hadoop  jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.1+169.127-streaming.jar \
   -input /data/omni/20110115/exp6-10122 -output /home/yan/visitorpy.out \
   -mapper SessionMap.py   -reducer  SessionRed.py  -file SessionMap.py \
   -file  SessionRed.py
Run Code Online (Sandbox Code Playgroud)

Session*.py的模式是755,#!/usr/bin/env python是*.py文件中的第一行.Mapper.py是:

#!/usr/bin/env python
import sys
 for line in sys.stdin:
         val=line.split("\t")
         (visidH,visidL,sessionID)=(val[4],val[5],val[108])
         print "%s%s\t%s" % (visidH,visidL,sessionID)
Run Code Online (Sandbox Code Playgroud)

日志错误:

java.io.IOException: Broken pipe
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
    at java.io.DataOutputStream.write(DataOutputStream.java:90)
    at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
    at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
    at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:110)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)
    at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:126)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)
Run Code Online (Sandbox Code Playgroud)

Aka*_*wal 5

我遇到了同样的问题并且想知道因为当我测试我的mapper和reducer测试数据时它会运行.但是当我通过hadoop map-reduce运行相同的测试集时,我常常遇到同样的问题.

如何在本地测试代码:

cat <test file> | python mapper.py | sort | python reducer.py
Run Code Online (Sandbox Code Playgroud)

在更多的调查中,我发现我没有在mapper.py脚本中包含'shebang line'.

#!/usr/bin/python
Run Code Online (Sandbox Code Playgroud)

请将上面的行添加为python脚本的第一行,并在此后留空行.

如果您需要了解更多关于'shebang line'的信息,请阅读为什么人们在Python脚本的第一行写#!/ usr/bin/env python?


hym*_*oth 0

Python + Hadoop 在一些不应该的细节上很棘手。看看这里

尝试将输入路径用双引号引起来。(-输入“/data/omni/20110115/exp6-10122”)