1fr*_*ggy 21 syntax hadoop map apache-pig
我想使用Apache Pig构建一个大键 - >值映射,在地图中查找并迭代键.然而,甚至似乎没有做这些事情的语法; 我检查了手册,维基,示例代码,象书,谷歌,甚至尝试解析解析器源.每个示例都从文件加载地图文字...然后从不使用它们.你怎么用Pig的地图?
首先,似乎没有办法直接将2列CSV文件加载到地图中.如果我有一个简单的map.csv:
1,2
3,4
5,6
Run Code Online (Sandbox Code Playgroud)
我尝试将其加载为地图:
m = load 'map.csv' using PigStorage(',') as (M: []);
dump m;
Run Code Online (Sandbox Code Playgroud)
我得到三个空元组:
()
()
()
Run Code Online (Sandbox Code Playgroud)
所以我尝试加载元组然后生成地图:
m = load 'map.csv' using PigStorage(',') as (key:chararray, val:chararray);
b = foreach m generate [key#val];
ERROR 1000: Error during parsing. Encountered " "[" "[ "" at line 1, column 24.
...
Run Code Online (Sandbox Code Playgroud)
语法的许多变体也会失败(例如generate [$0#$1]).
好的,所以我把我的地图变成Pig的地图文字格式map.pig:
[1#2]
[3#4]
[5#6]
Run Code Online (Sandbox Code Playgroud)
加载它:
m = load 'map.pig' as (M: []);
Run Code Online (Sandbox Code Playgroud)
现在让我们加载一些键并尝试查找:
k = load 'keys.csv' as (key);
dump k;
3
5
1
c = foreach k generate m#key; /* Or m[key], or... what? */
ERROR 1000: Error during parsing. Invalid alias: m in {M: map[ ]}
Run Code Online (Sandbox Code Playgroud)
嗯,好吧,也许因为涉及两个关系,我们需要一个联接:
c = join k by key, m by /* ...um, what? */ $0;
dump c;
ERROR 1068: Using Map as key not supported.
c = join k by key, m by m#key;
dump c;
Error 1000: Error during parsing. Invalid alias: m in {M: map[ ]}
Run Code Online (Sandbox Code Playgroud)
失败.如何引用地图的键(或值)?地图模式语法似乎不允许您为键和值命名(邮件列表表示无法分配类型).
最后,我只想在我的地图中找到所有键:
d = foreach m generate ...oh, forget it.
Run Code Online (Sandbox Code Playgroud)
猪的地图类型是半生不熟的吗?我错过了什么?
我认为你需要从关系的角度来思考,而地图只是一个记录的一个字段。然后您可以对关系应用一些操作,例如连接两组数据和映射:
输入
$ cat data.txt
1
2
3
4
5
$ cat mapping.txt
1 2
2 4
3 6
4 8
5 10
Run Code Online (Sandbox Code Playgroud)
猪
mapping = LOAD 'mapping.txt' AS (key:CHARARRAY, value:CHARARRAY);
data = LOAD 'data.txt' AS (value:CHARARRAY);
-- list keys
mapping_keys =
FOREACH mapping
GENERATE key;
DUMP mapping_keys;
-- join mapping to data
mapped_data =
JOIN mapping BY key, data BY value;
DUMP mapped_data;
Run Code Online (Sandbox Code Playgroud)
输出
> # keys
(1)
(2)
(3)
(4)
(5)
> # mapped data
(1,2,1)
(2,4,2)
(3,6,3)
(4,8,4)
(5,10,5)
Run Code Online (Sandbox Code Playgroud)
如果您只想进行简单的查找,这个答案也可以帮助您: pass-a-relation-to-a-pig-udf-when-using-foreach-on-another-relation