如何构建/运行这个简单的Mahout程序而不会出现异常?

dra*_*nxo 8 java hadoop mahout

我想运行我在Mahout In Action中找到的代码:

package org.help;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.mahout.math.DenseVector;
import org.apache.mahout.math.NamedVector;
import org.apache.mahout.math.VectorWritable;

public class SeqPrep {

    public static void main(String args[]) throws IOException{

        List<NamedVector> apples = new ArrayList<NamedVector>();

        NamedVector apple;

        apple = new NamedVector(new DenseVector(new double[]{0.11, 510, 1}), "small round green apple");        

        apples.add(apple);

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        Path path = new Path("appledata/apples");

        SequenceFile.Writer writer = new SequenceFile.Writer(fs,  conf, path, Text.class, VectorWritable.class);

        VectorWritable vec = new VectorWritable();
        for(NamedVector vector : apples){
            vec.set(vector);
            writer.append(new Text(vector.getName()), vec);
        }
        writer.close();

        SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path("appledata/apples"), conf);

        Text key = new Text();
        VectorWritable value = new VectorWritable();
        while(reader.next(key, value)){
            System.out.println(key.toString() + " , " + value.get().asFormatString());
        }
        reader.close();

    }

}
Run Code Online (Sandbox Code Playgroud)

我编译它:

$ javac -classpath :/usr/local/hadoop-1.0.3/hadoop-core-1.0.3.jar:/home/hduser/mahout/trunk/core/target/mahout-core-0.8-SNAPSHOT.jar:/home/hduser/mahout/trunk/core/target/mahout-core-0.8-SNAPSHOT-job.jar:/home/hduser/mahout/trunk/core/target/mahout-core-0.8-SNAPSHOT-sources.jar -d myjavac/ SeqPrep.java
Run Code Online (Sandbox Code Playgroud)

我把它:

$ jar -cvf SeqPrep.jar -C myjavac/ .
Run Code Online (Sandbox Code Playgroud)

现在我想在我的本地hadoop节点上运行它.我试过了:

 hadoop jar SeqPrep.jar org.help.SeqPrep
Run Code Online (Sandbox Code Playgroud)

但我得到:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/mahout/math/Vector
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:247)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:149)
Run Code Online (Sandbox Code Playgroud)

所以我尝试使用libjars参数:

$ hadoop jar SeqPrep.jar org.help.SeqPrep -libjars /home/hduser/mahout/trunk/core/target/mahout-core-0.8-SNAPSHOT.jar -libjars /home/hduser/mahout/trunk/core/target/mahout-core-0.8-SNAPSHOT-job.jar -libjars /home/hduser/mahout/trunk/core/target/mahout-core-0.8-SNAPSHOT-sources.jar -libjars /home/hduser/mahout/trunk/math/target/mahout-math-0.8-SNAPSHOT.jar -libjars /home/hduser/mahout/trunk/math/target/mahout-math-0.8-SNAPSHOT-sources.jar
Run Code Online (Sandbox Code Playgroud)

并得到了同样的问题.我不知道还有什么可以尝试的.

我最终的目标是能够将hadoop fs上的.csv文件读入稀疏矩阵,然后将其乘以随机向量.

编辑:看起来像Razvan得到它(注意:请参阅下面的另一种方法,这样做不会弄乱你的hadoop安装).以供参考:

$ find /usr/local/hadoop-1.0.3/. |grep mah
/usr/local/hadoop-1.0.3/./lib/mahout-core-0.8-SNAPSHOT-tests.jar
/usr/local/hadoop-1.0.3/./lib/mahout-core-0.8-SNAPSHOT.jar
/usr/local/hadoop-1.0.3/./lib/mahout-core-0.8-SNAPSHOT-job.jar
/usr/local/hadoop-1.0.3/./lib/mahout-core-0.8-SNAPSHOT-sources.jar
/usr/local/hadoop-1.0.3/./lib/mahout-math-0.8-SNAPSHOT-sources.jar
/usr/local/hadoop-1.0.3/./lib/mahout-math-0.8-SNAPSHOT-tests.jar
/usr/local/hadoop-1.0.3/./lib/mahout-math-0.8-SNAPSHOT.jar
Run Code Online (Sandbox Code Playgroud)

然后:

$hadoop jar SeqPrep.jar org.help.SeqPrep

small round green apple , small round green apple:{0:0.11,1:510.0,2:1.0}
Run Code Online (Sandbox Code Playgroud)

编辑: 我试图这样做而不将mahout罐子复制到hadoop lib /

$ rm /usr/local/hadoop-1.0.3/lib/mahout-*
Run Code Online (Sandbox Code Playgroud)

然后当然:

hadoop jar SeqPrep.jar org.help.SeqPrep

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/mahout/math/Vector
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:247)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:149)
Caused by: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Run Code Online (Sandbox Code Playgroud)

当我尝试mahout工作文件时:

$hadoop jar ~/mahout/trunk/core/target/mahout-core-0.8-SNAPSHOT-job.jar org.help.SeqPrep

Exception in thread "main" java.lang.ClassNotFoundException: org.help.SeqPrep
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:247)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:149)
Run Code Online (Sandbox Code Playgroud)

如果我尝试包含我制作的.jar文件:

$ hadoop jar ~/mahout/trunk/core/target/mahout-core-0.8-SNAPSHOT-job.jar SeqPrep.jar org.help.SeqPrep

Exception in thread "main" java.lang.ClassNotFoundException: SeqPrep.jar
Run Code Online (Sandbox Code Playgroud)

编辑:显然我一次只能发送一个罐子到hadoop.这意味着我需要将我制作的类添加到mahout核心作业文件中:

~/mahout/trunk/core/target$ cp mahout-core-0.8-SNAPSHOT-job.jar mahout-core-0.8-SNAPSHOT-job.jar_backup

~/mahout/trunk/core/target$ cp ~/workspace/seqprep/bin/org/help/SeqPrep.class .

~/mahout/trunk/core/target$ jar uf mahout-core-0.8-SNAPSHOT-job.jar SeqPrep.class
Run Code Online (Sandbox Code Playgroud)

然后:

~/mahout/trunk/core/target$ hadoop jar mahout-core-0.8-SNAPSHOT-job.jar org.help.SeqPrep

Exception in thread "main" java.lang.ClassNotFoundException: org.help.SeqPrep
Run Code Online (Sandbox Code Playgroud)

编辑:好的,现在我可以在不搞乱我的hadoop安装的情况下完成.我在之前的编辑中更新了.jar错误.它应该是:

~/mahout/trunk/core/target$ jar uf mahout-core-0.8-SNAPSHOT-job.jar org/help/SeqPrep.class
Run Code Online (Sandbox Code Playgroud)

然后:

~/mahout/trunk/core/target$ hadoop jar mahout-core-0.8-SNAPSHOT-job.jar org.help.SeqPrep

small round green apple , small round green apple:{0:0.11,1:510.0,2:1.0}
Run Code Online (Sandbox Code Playgroud)

Sea*_*wen 11

您需要使用Mahout提供的"job"JAR文件.它打包了所有依赖项.您还需要将类添加到其中.这就是所有Mahout示例的工作方式.你不应该把Mahout jar放在Hadoop lib中,因为那种在Hadoop中"安装"程序太深了.


Ale*_*Ott 7

如果您将从https://github.com/tdunning/MiA存储库获取示例代码,那么它包含pom.xmlMaven的即用型文件.当您使用编译代码时mvn package,它将mia-0.1-job.jartarget目录中创建- 此存档包含除Hadoop之外的所有依赖项,因此您可以在Hadoop集群上运行它而不会出现问题