我正在写一个MapReduce作业,最终可能会在reducer中有大量的值.我担心所有这些值会立即加载到内存中.
Iterable<VALUEIN> values负载值的底层实现是否需要在内存中?Hadoop:The Definitive Guide似乎暗示了这种情况,但没有给出"明确"的答案.
reducer输出将远远大于输入的值,但我相信输出会根据需要写入磁盘.
Gir*_*Rao 13
你正确地读了这本书.reducer不会将所有值存储在内存中.相反,当循环遍历Iterable值列表时,每个Object实例都会被重用,因此它只在给定时间保留一个实例.
例如,在下面的代码中,objs ArrayList将在循环之后具有预期的大小,但每个元素将是相同的b/c,每次迭代都会重复使用Text val实例.
public static class ReducerExample extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) {
    ArrayList<Text> objs = new ArrayList<Text>();
            for (Text val : values){
                    objs.add(val);
            }
    }
}
(如果由于某种原因你确实想对每个val采取进一步行动,你应该制作一份深层副本然后存储它.)
当然,即使单个值也可能比内存大.在这种情况下,建议开发人员采取措施在前面的Mapper中削减数据,以使值不会太大.
更新:请参阅Hadoop The Definitive Guide第2版的第199-200页.
This code snippet makes it clear that the same key and value objects are used on each 
invocation of the map() method -- only their contents are changed (by the reader's 
next() method). This can be a surprise to users, who might expect keys and vales to be 
immutable. This causes prolems when a reference to a key or value object is retained 
outside the map() method, as its value can change without warning. If you need to do 
this, make a copy of the object you want to hold on to. For example, for a Text object, 
you can use its copy constructor: new Text(value).
The situation is similar with reducers. In this case, the value object in the reducer's 
iterator are reused, so you need to copy any that you need to retain between calls to 
the iterator.
| 归档时间: | 
 | 
| 查看次数: | 4439 次 | 
| 最近记录: |