如何在Apache Pig中使用map数据类型?

1fr*_*ggy 21 syntax hadoop map apache-pig

我想使用Apache Pig构建一个大键 - >值映射,在地图中查找并迭代键.然而,甚至似乎没有做这些事情的语法; 我检查了手册,维基,示例代码,象书,谷歌,甚至尝试解析解析器源.每个示例都从文件加载地图文字...然后从不使用它们.你怎么用Pig的地图?

首先,似乎没有办法直接将2列CSV文件加载到地图中.如果我有一个简单的map.csv:

1,2
3,4
5,6
Run Code Online (Sandbox Code Playgroud)

我尝试将其加载为地图:

m = load 'map.csv' using PigStorage(',') as (M: []);
dump m;
Run Code Online (Sandbox Code Playgroud)

我得到三个空元组:

()
()
()
Run Code Online (Sandbox Code Playgroud)

所以我尝试加载元组然后生成地图:

m = load 'map.csv' using PigStorage(',') as (key:chararray, val:chararray);
b = foreach m generate [key#val];
ERROR 1000: Error during parsing. Encountered " "[" "[ "" at line 1, column 24.
...
Run Code Online (Sandbox Code Playgroud)

语法的许多变体也会失败(例如generate [$0#$1]).

好的,所以我把我的地图变成Pig的地图文字格式map.pig:

[1#2]
[3#4]
[5#6]
Run Code Online (Sandbox Code Playgroud)

加载它:

m = load 'map.pig' as (M: []);
Run Code Online (Sandbox Code Playgroud)

现在让我们加载一些键并尝试查找:

k = load 'keys.csv' as (key);
dump k;
3
5
1

c = foreach k generate m#key;  /* Or m[key], or... what? */
ERROR 1000: Error during parsing.  Invalid alias: m in {M: map[ ]}
Run Code Online (Sandbox Code Playgroud)

嗯,好吧,也许因为涉及两个关系,我们需要一个联接:

c = join k by key, m by /* ...um, what? */ $0;
dump c;
ERROR 1068: Using Map as key not supported.
c = join k by key, m by m#key;
dump c;
Error 1000: Error during parsing. Invalid alias: m in {M: map[ ]}
Run Code Online (Sandbox Code Playgroud)

失败.如何引用地图的键(或值)?地图模式语法似乎不允许您为键和值命名(邮件列表表示无法分配类型).

最后,我只想在我的地图中找到所有键:

d = foreach m generate ...oh, forget it.
Run Code Online (Sandbox Code Playgroud)

猪的地图类型是半生不熟的吗?我错过了什么?

Rom*_*ain 0

我认为你需要从关系的角度来思考,而地图只是一个记录的一个字段。然后您可以对关系应用一些操作,例如连接两组数据映射

输入

$ cat data.txt 
1
2
3
4
5
$ cat mapping.txt 
1   2
2   4
3   6
4   8
5   10
Run Code Online (Sandbox Code Playgroud)

mapping = LOAD 'mapping.txt' AS (key:CHARARRAY, value:CHARARRAY);

data = LOAD 'data.txt' AS (value:CHARARRAY);


-- list keys
mapping_keys =
  FOREACH mapping
  GENERATE key;

DUMP mapping_keys;


-- join mapping to data
mapped_data =
  JOIN mapping BY key, data BY value;

DUMP mapped_data;
Run Code Online (Sandbox Code Playgroud)

输出

> # keys
(1)
(2)
(3)
(4)
(5)

> # mapped data
(1,2,1)
(2,4,2)
(3,6,3)
(4,8,4)
(5,10,5)
Run Code Online (Sandbox Code Playgroud)

如果您只想进行简单的查找,这个答案也可以帮助您: pass-a-relation-to-a-pig-udf-when-using-foreach-on-another-relation