NYC*_*yes 2 hadoop mapreduce exception apache-pig
这是我的(看似琐碎的)PIG脚本,后面是它生成的异常:
raw_logs = LOAD './Apache-WebLog-Samples.d/access_log.txt' USING TextLoader() AS (line:chararray);
logs = FOREACH raw_logs GENERATE FLATTEN (
REGEX_EXTRACT_ALL(line, '^(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+\\[([\\w:/]+\\s[+\\-]\\d{4})\\]\\s+"(..*)"\\s+(\\S+)\\s+(\\S+)'))
AS (remoteAddr: chararray,
remoteLogname: chararray,
user: chararray,
date_time: chararray,
request: chararray,
httpStatus: int, <- Here's the problem. But goes away when I set to chararray.
numBytes: int);
httpGET200 = FILTER logs BY (request MATCHES '^GET\\s.*') AND (httpStatus == 200);
mylimit = LIMIT httpGET200 40;
DUMP mylimit;
Run Code Online (Sandbox Code Playgroud)
猪脚本
java.lang.Exception: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:404)
Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String
[ ... non meaningful error output removed ... ]
2013-03-13 14:04:10,882 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.0.0-cdh4.2.0 0.10.0-cdh4.2.0 nmvega 2013-03-13 14:04:05 2013-03-13 14:04:10 FILTER,LIMIT
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_local1982169921_0001 httpGET200,logs,mylimit,raw_logs Message: Job failed!
Input(s):
Failed to read data from "file:///home/user/Dropbox/CodeDEV.d/BIG-DATA-SNIPPETS.d/PIG.d/Apache-WebLog-Samples.d/access_log.txt"
Output(s):
Run Code Online (Sandbox Code Playgroud)
例外信息
一切都有效,除了'httpGET200'关系.由于我不明白的原因,条款"httpStatus == 200"会导致上述异常.当我删除该条款时,问题就消失了.或者,当我改变模式并声明'httpStatus'为"chararray"类型而不是"int"时(如上所述并且适用于HTTP状态代码),问题也消失了......(当然,当我这样做时,我必须编辑关系以插入引号,如下所示:httpStatus =='200').
我检查了输入数据文件,并验证了对于每一行,对应于'httpStatus'的字段确实总是一个整数(...好,一个表示整数的子字符串).
顺便说一下,这样的模式是grunt报告它(即预期的):
grunt> describe httpGET200;
httpGET200: {remoteAddr: chararray,remoteLogname: chararray,user: chararray,date_time: chararray,request: chararray,httpStatus: int,numBytes: int}
Run Code Online (Sandbox Code Playgroud)
我想了解这里发生了什么(我的误解或PIG限制).谁能摆脱光明?
谢谢!
在我看来,在REGEX_EXTRACT_ALL的情况下,将输出模式中的字段设置为int将导致稍后ClassCastException在该字段上执行算术运算时.可能是因为所有字段都保留并且在返回的元组内被视为chararray,尽管给定的模式.
作为一种解决方法,您可以将所有字段设置为chararray,然后执行显式转换(转换):
logs = FOREACH raw_logs ....
conv = FOREACH logs generate remoteAddr, remoteLogname, user, date_time,
request, (int)httpStatus, (int)numBytes;
Run Code Online (Sandbox Code Playgroud)
然后,您可以应用最初使用的过滤器:
httpGET200 = FILTER conv BY (request MATCHES '^GET\\s.*') AND (httpStatus == 200);
Run Code Online (Sandbox Code Playgroud)
您可以在此故障单中找到有关类似问题的更多信息:
| 归档时间: |
|
| 查看次数: |
2796 次 |
| 最近记录: |