如何使用 apache pig 在 hadoop 集群上加载文件？

Question

如何使用 apache pig 在 hadoop 集群上加载文件？

我有一个猪脚本，需要从本地 hadoop 集群加载文件。我可以使用 hadoop 命令列出文件：hadoop fs –ls /repo/mydata,` 但是当我尝试在 pig 脚本中加载文件时，它失败了。加载语句是这样的：

in = LOAD '/repo/mydata/2012/02' USING PigStorage() AS (event:chararray, user:chararray)

Run Code Online (Sandbox Code Playgroud)

错误信息是：

Message: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: file:/repo/mydata/2012/02

Run Code Online (Sandbox Code Playgroud)

任何的想法？谢谢

Answer 1

Cha*_*tra 5

我的建议：

在 hdfs 中创建一个文件夹： hadoop fs -mkdir /pigdata
将文件加载到创建的 hdfs 文件夹中： hadoop fs -put /opt/pig/tutorial/data/excite-small.log /pigdata

（或者你可以从 grunt shell 中做到这一点grunt> copyFromLocal /opt/pig/tutorial/data/excite-small.log /pigdata）

执行猪拉丁脚本：

   grunt> set debug on

   grunt> set job.name 'first-p2-job'

   grunt> log = LOAD 'hdfs://hostname:54310/pigdata/excite-small.log' AS 
              (user:chararray, time:long, query:chararray); 
   grunt> grpd = GROUP log BY user; 
   grunt> cntd = FOREACH grpd GENERATE group, COUNT(log); 
   grunt> STORE cntd INTO 'output';

Run Code Online (Sandbox Code Playgroud)

输出文件将存储在 hdfs://hostname:54310/pigdata/output

归档时间：	14 年前
查看次数：	23740 次
最近记录：	10 年，6 月前