从 IPython 笔记本运行 MRJob

szu*_*szu 5 python mapreduce mrjob ipython-notebook

我正在尝试从 IPython 笔记本运行 mrjob 示例

from mrjob.job import MRJob


class MRWordFrequencyCount(MRJob):

def mapper(self, _, line):
    yield "chars", len(line)
    yield "words", len(line.split())
    yield "lines", 1

def reducer(self, key, values):
    yield key, sum(values)  
Run Code Online (Sandbox Code Playgroud)

然后用代码运行它

mr_job = MRWordFrequencyCount(args=["testfile.txt"])
with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        key, value = mr_job.parse_output_line(line)
        print key, value
Run Code Online (Sandbox Code Playgroud)

并得到错误:

TypeError: <module '__main__' (built-in)> is a built-in class
Run Code Online (Sandbox Code Playgroud)

有没有办法从 IPython notebook 运行 mrjob?

sap*_*ico 1

我怀疑这是由于MRJob 网站上声明的限制造成的:

包含作业类的文件将发送到 Hadoop 来运行。因此,作业文件不能尝试启动 Hadoop 作业,否则您将递归地创建 Hadoop 作业!运行作业的代码只能在 Hadoop 上下文之外运行。

或者,可能是因为您没有以下内容(参考):

if __name__ == '__main__':  
  MRWordCounter.run()  # where MRWordCounter is your job class
Run Code Online (Sandbox Code Playgroud)