如何处理猪的溢油记忆

mar*_*ark 2 hadoop apache-pig

我的代码是这样的:

pymt = LOAD 'pymt' USING PigStorage('|') AS ($pymt_schema);

pymt_grp = GROUP pymt BY key

results = FOREACH pymt_grp {

      /*
       *   some kind of logic, filter, count, distinct, sum, etc.
       */
}
Run Code Online (Sandbox Code Playgroud)

但现在我发现很多这样的日志:

org.apache.pig.impl.util.SpillableMemoryManager: Spilled an estimate of 207012796 bytes from 1 objects. init = 5439488(5312K) used = 424200488(414258K) committed = 559284224(546176K) max = 559284224(546176K)
Run Code Online (Sandbox Code Playgroud)

其实我找到了原因,大多数原因是有一个"热"键,有些东西比如key = 0作为ip地址,但我不想过滤这个键.有什么办法吗?我在UDF中实现了代数和累加器接口.

ale*_*pab 6

我有严重偏斜数据的类似问题或嵌套在FOREACH中的DISTINCT(因为PIG将在内存中做一个不同的).解决方案是将DISTINCT从FOREACH中取出作为一个例子,看看我如何在PIG拉丁语中优化分组声明的答案

如果您不希望在SUM和COUNT之前执行DISTINCT,而不是建议使用2 GROUP BY.键列上的第一个组加上另一个列或随机数mod 100,它充当Salt(将单个键的数据分散到多个Reducers中).比Key列上的第二个GROUP BY计算组1 COUNT或Sum的最终SUM.

例如:

inpt = load '/data.csv' using PigStorage(',') as (Key, Value);
view = foreach inpt generate Key, Value, ((int)(RANDOM() * 100)) as Salt;

group_1 = group view by (Key, Salt);
group_1_count = foreach group_1 generate group_1.Key as Key, COUNT(view) as count;

group_2 = group group_1_count by Key;
final_count = foreach group_2 generate flatten(group) as Key, SUM(group_1_count.count) as count;
Run Code Online (Sandbox Code Playgroud)