Hadoop使用mongo-hadoop流式传输到python

Con*_*nor 6 python hadoop mongodb cloudera

我正在尝试使用mongo-hadoop获取python的map-reduce功能.Hadoop正在工作,hadoop流媒体正在使用python和mongo-hadoop适配器正在工作.但是,使用python的mongo-hadoop流示例不起作用.当尝试在流/示例/财务中运行示例时,我收到以下错误:

$用户@主机:〜/ GIT中/蒙戈-的hadoop /流$ hadoop的jar目标/蒙戈-Hadoop的流组装-1.0.1.jar -mapper实例/金库/ mapper.py -reducer实例/金库/ reducer.py -inputformat com.mongodb.hadoop.mapred.MongoInputFormat -outputformat com.mongodb.hadoop.mapred.MongoOutputFormat -inputURI mongodb://127.0.0.1/mongo_hadoop.yield_historical.in -outputURI mongodb://127.0.0.1/mongo_hadoop.yield_historical .streaming.out

13/04/09 11:54:34 INFO streaming.MongoStreamJob: Running

13/04/09 11:54:34 INFO streaming.MongoStreamJob: Init

13/04/09 11:54:34 INFO streaming.MongoStreamJob: Process Args

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Setup Options'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: PreProcess Args

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Parse Options

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Arg: '-mapper'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Arg: 'examples/treasury/mapper.py'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Arg: '-reducer'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Arg: 'examples/treasury/reducer.py'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Arg: '-inputformat'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Arg: 'com.mongodb.hadoop.mapred.MongoInputFormat'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Arg: '-outputformat'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Arg: 'com.mongodb.hadoop.mapred.MongoOutputFormat'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Arg: '-inputURI'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Arg: 'mongodb://127.0.0.1/mongo_hadoop.yield_historical.in'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Arg: '-outputURI'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Arg: 'mongodb://127.0.0.1/mongo_hadoop.yield_historical.streaming.out'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Add InputSpecs

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Setup output_

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Post Process Args

13/04/09 11:54:34 INFO streaming.MongoStreamJob: Args processed.

13/04/09 11:54:36 INFO io.MongoIdentifierResolver: Resolving: bson

13/04/09 11:54:36 INFO io.MongoIdentifierResolver: Resolving: bson

13/04/09 11:54:36 INFO io.MongoIdentifierResolver: Resolving: bson

13/04/09 11:54:36 INFO io.MongoIdentifierResolver: Resolving: bson

**Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/filecache/DistributedCache**
    at org.apache.hadoop.streaming.StreamJob.setJobConf(StreamJob.java:959)
    at com.mongodb.hadoop.streaming.MongoStreamJob.run(MongoStreamJob.java:36)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at com.mongodb.hadoop.streaming.MongoStreamJob.main(MongoStreamJob.java:63)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.filecache.DistributedCache
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
    ... 10 more
Run Code Online (Sandbox Code Playgroud)

如果有人能说出一些亮点,那将是一个很大的帮助.

完整信息:

据我所知,我需要完成以下四项工作:

  1. 安装并测试hadoop
  2. 使用python安装和测试hadoop流
  3. 安装并测试mongo-hadoop
  4. 使用python安装和测试mongo-hadoop流

所以缺点是我已经完成了第四步的所有工作.使用(https://github.com/danielpoe/cloudera)我安装了cloudera 4

  1. 使用厨师配方cloudera 4已经安装并启动并运行和测试
  2. 使用michael nolls博客教程,成功测试了hadoop流与python
  3. 使用mongodb.org上的文档能够运行库和ufo示例(在build.sbt中构建cdh4)
  4. 已经使用流媒体/示例中的twitter示例的自述文件下载了1.5小时的推特数据,并且还尝试了财务示例.

Ros*_*oss 0

您安装了最新的 pymongo_hadoop 连接器吗?您运行的其他软件是什么版本?