如何在hadoop中实现排序?

use*_*364 8 sorting hadoop mapreduce

我的问题是在文件中排序值.键和值是整数,需要维护排序值的键.

key   value
1     24
3     4
4     12
5     23
Run Code Online (Sandbox Code Playgroud)

输出:

1     24
5     23
4     12
3     4
Run Code Online (Sandbox Code Playgroud)

我正在处理大量数据,必须在hadoop机器集群中运行代码.我怎么能用mapreduce做到这一点?

SSa*_*ker 15

你可以这样做(我假设你在这里使用Java)

从地图发出这样的 -

context.write(24,1);
context.write(4,3);
context.write(12,4)
context.write(23,5)
Run Code Online (Sandbox Code Playgroud)

因此,所有需要排序的值应该是mapreduce作业中的关键.默认情况下,Hadoop按键的升序排序.

因此,要么这样做要按降序排序,

job.setSortComparatorClass(LongWritable.DecreasingComparator.class);
Run Code Online (Sandbox Code Playgroud)

或这个,

你需要设置一个自定义的Descending Sort Comparator,它在你的工作中就是这样的.

public static class DescendingKeyComparator extends WritableComparator {
    protected DescendingKeyComparator() {
        super(Text.class, true);
    }

    @SuppressWarnings("rawtypes")
    @Override
    public int compare(WritableComparable w1, WritableComparable w2) {
        LongWritable key1 = (LongWritable) w1;
        LongWritable key2 = (LongWritable) w2;          
        return -1 * key1.compareTo(key2);
    }
}
Run Code Online (Sandbox Code Playgroud)

Hadoop中的后缀和排序阶段将按照24,4,12,23的降序对键进行排序

评论后:

如果你需要一个Descending IntWritable Comparable,你可以创建一个并像这样使用它 -

job.setSortComparatorClass(DescendingIntComparable.class);
Run Code Online (Sandbox Code Playgroud)

如果您使用的是JobConf,请使用此设置

jobConfObject.setOutputKeyComparatorClass(DescendingIntComparable.class);
Run Code Online (Sandbox Code Playgroud)

将以下代码放在您的main()函数下面-

public static void main(String[] args) {
    int exitCode = ToolRunner.run(new YourDriver(), args);
    System.exit(exitCode);
}

//this class is defined outside of main not inside
public static class DescendingIntWritableComparable extends IntWritable {
    /** A decreasing Comparator optimized for IntWritable. */ 
    public static class DecreasingComparator extends Comparator {
        public int compare(WritableComparable a, WritableComparable b) {
            return -super.compare(a, b);
        }
        public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
            return -super.compare(b1, s1, l1, b2, s2, l2);
        }
    }
}
Run Code Online (Sandbox Code Playgroud)