由于数据类型错误,我无法解决一大堆值.
当我加载一个csv文件,其行如下所示:
6 574 false 10.1.72.23 2010-05-16 13:56:19 +0930 fbcdn.net static.ak.fbcdn.net 304 text/css 1 /rsrc.php/zPTJC/hash/50l7x7eg.css http pwong
Run Code Online (Sandbox Code Playgroud)
使用以下内容:
logs_base = FOREACH raw_logs GENERATE
FLATTEN(
EXTRACT(line, '^(\\d+),"(\\d+)","(\\w+)","(\\S+)","(.+?)","(\\S+)","(\\S+)","(\\d+)","(\\S+)","(\\d+)","(\\S+)","(\\S+)","(\\S+)"')
)
as (
account_id: int,
bytes: long,
cached: chararray,
ip: chararray,
time: chararray,
domain: chararray,
host: chararray,
status: chararray,
mime_type: chararray,
page_view: chararray,
path: chararray,
protocol: chararray,
username: chararray
);
Run Code Online (Sandbox Code Playgroud)
所有字段似乎都可以正常加载,并且使用正确的类型,如"describe"命令所示:
grunt> describe logs_base
logs_base: {account_id: int,bytes: long,cached: chararray,ip: chararray,time: chararray,domain: chararray,host: chararray,status: chararray,mime_type: chararray,page_view: chararray,path: chararray,protocol: chararray,username: chararray}
Run Code Online (Sandbox Code Playgroud)
每当我执行SUM时使用:
bytesCount = FOREACH (GROUP …Run Code Online (Sandbox Code Playgroud)