min*_*aut 4 apache-pig elastic-map-reduce
由于数据类型错误,我无法解决一大堆值.
当我加载一个csv文件,其行如下所示:
6 574 false 10.1.72.23 2010-05-16 13:56:19 +0930 fbcdn.net static.ak.fbcdn.net 304 text/css 1 /rsrc.php/zPTJC/hash/50l7x7eg.css http pwong
Run Code Online (Sandbox Code Playgroud)
使用以下内容:
logs_base = FOREACH raw_logs GENERATE
FLATTEN(
EXTRACT(line, '^(\\d+),"(\\d+)","(\\w+)","(\\S+)","(.+?)","(\\S+)","(\\S+)","(\\d+)","(\\S+)","(\\d+)","(\\S+)","(\\S+)","(\\S+)"')
)
as (
account_id: int,
bytes: long,
cached: chararray,
ip: chararray,
time: chararray,
domain: chararray,
host: chararray,
status: chararray,
mime_type: chararray,
page_view: chararray,
path: chararray,
protocol: chararray,
username: chararray
);
Run Code Online (Sandbox Code Playgroud)
所有字段似乎都可以正常加载,并且使用正确的类型,如"describe"命令所示:
grunt> describe logs_base
logs_base: {account_id: int,bytes: long,cached: chararray,ip: chararray,time: chararray,domain: chararray,host: chararray,status: chararray,mime_type: chararray,page_view: chararray,path: chararray,protocol: chararray,username: chararray}
Run Code Online (Sandbox Code Playgroud)
每当我执行SUM时使用:
bytesCount = FOREACH (GROUP logs_base ALL) GENERATE SUM(logs_base.bytes);
Run Code Online (Sandbox Code Playgroud)
并存储或转储内容,mapreduce进程失败,并出现以下错误:
org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing sum in Initial
at org.apache.pig.builtin.LongSum$Initial.exec(LongSum.java:87)
at org.apache.pig.builtin.LongSum$Initial.exec(LongSum.java:65)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:253)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long
at org.apache.pig.builtin.LongSum$Initial.exec(LongSum.java:79)
... 15 more
Run Code Online (Sandbox Code Playgroud)
引起我注意的一句话是:
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long
Run Code Online (Sandbox Code Playgroud)
这让我相信提取功能没有将字节字段转换为所需的数据类型(长).
有没有办法强制提取函数转换为正确的数据类型?如何在不必对所有记录进行FOREACH的情况下转换值?(将时间转换为unix时间戳,并尝试查找MIN时会出现同样的问题.我当然希望找到一个不需要不必要投影的解决方案).
任何指针将不胜感激.非常感谢你的帮助.
此致,Jorge C.
PS我在亚马逊弹性mapreduce服务上以交互模式运行它.
你有没有试过投从UDF检索的数据?在此处应用架构不会执行任何转换.
例如
logs_base =
FOREACH raw_logs
GENERATE
FLATTEN(
(tuple(LONG,LONG,CHARARRAY,....)) EXTRACT(line, '^...')
)
AS (account_id: INT, ...);
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
9770 次 |
| 最近记录: |