Mar*_*ano 8 java-8 java-stream
处理数据时的常见问题是基于某些属性构建项目的排名.
该排名操作包括映射所述物品到一组序数(称为排名)的.在存在关系(即,具有相同排名的两个项目)的情况下,可以使用若干策略,在该上下文中,我们将假设使用标准竞争排名(也称为"1224"排名).
我想在这里调查的问题是使用流API进行此操作的" 最佳 "方式是什么.最好的意思是,最有效,最有效,最简单等等.
让我们从一个非常简单的例子开始:一个单词流(即String
s),应该通过减少长度来创建排名.如果我们采取本文件的第一段,即
A common problem when processing data is to build
a ranking of itemsbased on some property
Run Code Online (Sandbox Code Playgroud)
我们应该得到以下排名(通过反向长度):
1 - processing (10)
2 - property (8)
3 - problem (7)
3 - ranking (7)
5 - common (6)
6 - build (5)
6 - items (5)
6 - based (5)
9 - when (4)
9 - data (4)
9 - some (4)
12 - is (2)
12 - to (2)
12 - of (2)
12 - on (2)
16 - A (1)
16 - a (1)
Run Code Online (Sandbox Code Playgroud)
当两个项目共享相同的排名属性值时,会为它们分配相同的排名.
这里的主要问题是:
您使用什么流API构造来计算排名以及如何组织代码?
第二个相关问题是
你如何表示排名计算的结果?
这两个问题一个接一个地解决
如上所示,典型排名是以其相对排名值开头的项目列表.
对于排名的返回值,至少有两种可能的实现方式.
List<Rank>
排名的可能内部表示可以是包含排名值和项目的元素列表.例如,我们可以定义一个类:
public class Rank<T> {
protected int ranking;
protected T item;
protected Rank(int r, T i) { ranking=t; item=i; }
public int getRanking() { return ranking; }
public T getItem() { return item; }
static <T> Rank<T> factory(int ranking, T item) {
return new Rank<T>(ranking,item);
}
}
Run Code Online (Sandbox Code Playgroud)
或者,使用更加缓慢的代码:
public interface Rank<T> {
int getRanking();
T getItem();
static <T> Rank<T> factory(int ranking, T item) {
return new Rank<T>(){
public int getRanking() { return ranking; }
public T getItem() { return item; }
};}
}
Run Code Online (Sandbox Code Playgroud)
得到的排名将作为a返回List<Rank>
.必须按降序排序并按该顺序处理列表.
这种解决方案的潜在缺点是需要新的ad-hoc类或接口(即Rank
).
SortedMap<Integer,List<T>>
An alternative solution that uses only predefined standard classes and intefaces is based on a map.
The keys of the map correspond to the ranking values, while the values of the map consist in lists that contain the items sharing that ranking.
The return type of the ranking would be a SortedMap<T,List<T>>
.
The sorted map will be implicitly sorted and can be traversed according to the natural order of the key.
This latter solution seems preferable because it does not require any new class and can be better understood.
The ranking computation procedure can be implemented in different ways. Here two approaches are examined.
In all cases we need to have a function that serves the purpose of extracting the ranking property:
Function<String,Integer> propertyExtractor = String::length;
Run Code Online (Sandbox Code Playgroud)
Typically, the property type should be comparable, though to be more general (e.g. to sort strings in decreasing order) the property comparator can be defined:
Comparator<Integer> propertyComparator = reverseOrder();
Run Code Online (Sandbox Code Playgroud)
The two approaches are illustrated, referring to the case study described above, using a words
stream as a starting point:
String paragraph = "A common problem when processing data is to build a ranking of items based on some property";
Stream<String> words = Stream.of(paragraph.split(" "));
Run Code Online (Sandbox Code Playgroud)
An initial naive version of the procedure can be defined using a supporting map data structure upon which the stream operations work by means of side effects:
SortedMap<Integer,List<String>> ranking = new TreeMap<>();
Run Code Online (Sandbox Code Playgroud)
The procedure is as follows:
Sort the elements of the stream
words.sorted(comparing(propetyExtractor,propertyComparator))
Run Code Online (Sandbox Code Playgroud)Assign the ranking, starting with 1
and incrementing it
any time the property changes. The ranking has to be incremented
by the number of items that shared the current ranking.
.forEach( item -> {
Integer property = propertyExtractor.apply(item);
if(ranking.isEmpty()){
ranking.put(1,new LinkedList<String>());
}else{
Integer rank = ranking.lastKey();
List<String> items = ranking.get(rank);
if(! property.equals(propertyExtractor.apply(items.get(0)))) {
ranking.put(rank+items.size(), new LinkedList<String>());
}
}
ranking.get(ranking.lastKey()).add(item);
});
Run Code Online (Sandbox Code Playgroud)The above code works as follows:
The initial sequence of items is unsorted:
{"DD","BBB","F","CCC","AAAA","EE"}
Step 1 sorts the item using the propertyExtractor
and the relative propertyComparator
, i.e. by string length:
{"AAAA","BBB","CCC","DD","EE","F"}
then for each element (in order) a new entry in the map is added any time the lenght of the new element differs from the previous element lenght
When "AAAA"
is processed (first item) a new entry is added to
the map with key (i.e. ranking) = 1:
ranking
: {1=["AAAA"]}
As the second item "BBB"
is processed, since its lenght (3)
is different from the last element length (4, the lenght of "AAAA"
)
a new entry is added with un updated ranking value:
ranking
: {1=["AAAA"],2=["BBB"]}
As the third item "CCC"
is processed, since its lenght (3)
is equal to the last element length, the item is added
to the entry list
ranking
: {1=["AAAA"],2=["BBB","CCC"]}
etc.
A more functional-style version can be defined by using a collector that encapsulates the data structure where the results are accumulated. In practice we can write a single stream expression that returns the ranking map:
SortedMap<Integer,List<String>> ranking =
words.sorted(comparing(propertyExtractor,propertyComparator))
.collect(TreeMap::new, // supplier
(rank, item) -> { // accumulator
Integer property = propertyExtractor.apply(item);
if(rank.isEmpty()){
rank.put(1,new LinkedList<String>());
}else{
Integer r = rank.lastKey();
List<String> items = rank.get(r);
Integer prevProp = propertyExtractor.apply(items.get(0))
if(! property.equals(prevProp)) {
rank.put(r+items.size(), new LinkedList<String>());
}
}
rank.get(rank.lastKey()).add(item);
},
(rank1,rank2) -> { // combiner
\\...
}
);
Run Code Online (Sandbox Code Playgroud)
The combiner method that is left undefiened in the above code deserves some additional reflections.
The functional interface is involved when the collection is performed in parallel; it is used to combine two partial results of the accumulation into a single one. In this case, it has to implement the interface BiConsumer<R,R>
and it is supposed to combine the two accumulated rankings - rank1
and rank2
- into the first one.
A possible implementation of the supplier is:
BiConsumer<SortedMap<Integer,List<String>>,
SortedMap<Integer,List<String>>> combiner =
(rank1,rank2) -> {
int lastRanking = rank1.lastKey();
int offset = lastRanking + rank1.get(lastRanking).size()-1;
if( propertyExtractor.apply(rank1.get(lastRanking).get(0))
== propertyExtractor.apply(rank2.get(rank2.firstKey()).get(0)) ){
rank1.get(lastRanking).addAll(rank2.get(rank2.firstKey()));
rank2.remove(rank2.firstKey());
}
rank2.forEach((r,items) -> {rank1.put(offset+r, items);} );
}
Run Code Online (Sandbox Code Playgroud)
The collector has no declared properties, so the elements will be processed in order and then composed in order. E.g. the sorted list of items is divided into two parts, then the first part is processed through with one copy of the collector producing as result the map rank1
; in parallel, the second part is processed giving as result rank2
.
Let us suppose the streams are processed in parallel, in two concurent execution of the collect operations:
the initial sorted stream (result of the sorted()
operation is
divided into two stream, maintaining the orded
{"AAAA","BBB","CCC"}
{"DD","EE","F"}
two distinct collectors operate in parallel, on each sub-stream the result are two maps contain the partial ranking of each sub-stream:
rank1
= {1=["AAAA"],2=["BBB","CCC"]}
rank2
= {1=["DD","EE"],3=["F"]}
The combiner operation should merge rank2
into rank1
, in practice
each entry of the rank2
operation should be added to rank1
with its
key updated. They keys are updated by adding an offset that is equal to
the key of the last entry plus the lenght of the last entry value list
minus one:
int lastRanking = rank1.lastKey();
int offset = lastRanking + rank1.get(lastRanking).size()-1;
Run Code Online (Sandbox Code Playgroud)
In practice the entry 1=["DD","EE"]
of rank2
should be turned into
4=["DD","EE"]
and added to rank1
.
In addition, a special case should be considered, i.e. when the items
in the last entry of rank1
and those in the first entry of rank2
share the same ranking property value. E.g. for string length:
rank1
= {1=["AAAA"],2=["BBB","CCC"]}
rank2
= {1=["DDD"],2=["EE"],3=["F"]}
When this case occurs, the items in the rank2
first entry list
must be added to this in the last rank1
entry list, and then the
first entry removed.
That is the above maps should be transformed into:
rank1
= {1=["AAAA"],2=["BBB","CCC"
,"DDD"
]}
rank2
= {
~~1=["DDD"],
~~2=["EE"],3=["F"]}
Then the entries of rank2
can be updated and added to rank1
as
described above.
As for the sorted list version, an initial naive version of the procedure can be defined using a supporting map data structure upon which the stream operations work by means of side effects:
SortedMap<Integer,List<String>> ranking = new TreeMap<>();
Run Code Online (Sandbox Code Playgroud)
The naive procedure is as follows:
group the items by their property in a sorted map
words.collect(groupingBy(propertyExtractor,
()->new TreeMap<>(propertyComparator),toList()))
Run Code Online (Sandbox Code Playgroud)for each entry in the above map compute the ranking
.forEach(
(property,items) ->
ranking.put(ranking.isEmpty()?1:ranking.lastKey()+
ranking.get(ranking.lastKey()).size(),
items )
);
Run Code Online (Sandbox Code Playgroud)The above code works as follows:
The initial sequence of items is unsorted:
{"DD","BBB","F","CCC","AAAA","EE"}
Step 1 groups the item by their propertyExtractor
and store them into
a sorted set whose keys are sorted through the propertyComparator
, i.e.
by string length:
{4=["AAAA"],3=["BBB","CCC"],2=["DD","EE"],1=["F"]}
Step 2, creates for each entry in the intermediate map a new entry in the result map have the ranking value as key and the same value (i.e. the list of items) as the intermediate entry.
The ranking is computed as follows:
ranking.lastKey()
) plus number of elements
sharing the previous ranking (ranking.get(ranking.lastKey()).size()
).The result is the final ranking map:
{1=["AAAA"],2=["BBB","CCC"],4=["DD","EE"],6=["F"]}
The above procedure can be rewritten using a collector to avoid side effects of the operations.
Since the first step consists in a collection, we can use the predefined collector factory methdo collectingAndThen
to concatenate the first collector with a function that is applied to the collector result;
such a function will perform the step 2 described above.
SortedMap<Integer,List<String>> ranking =
words.collect(
collectingAndThen(
groupingBy(propertyExtractor,
()->new TreeMap<>(propertyComparator),
toList()),
map -> map.entrySet().stream().collect(
TreeMap::new,
(rank,entry) ->
rank.put(rank.isEmpty()?1:rank.lastKey()+
rank.get(rank.lastKey()).size(),
entry.getValue() ),
combiner
)
)
);
Run Code Online (Sandbox Code Playgroud)
Since the structure of the result, i.e. the accumulator object, is the same as the sorted stream solution, here the same combiner can be used.
The above discussion and solution applied to the special case of a stream of Strings. But the approach can be generalized using generics.
The solution based on the sorted stream can be enclosed in a function:
static <T,V> SortedMap<Integer,List<T>>
rank(Stream<T> stream,
Function<T,V> propertyExtractor,
Comparator<V> propertyComparator){
return
stream.sorted(comparing(propertyExtractor,propertyComparator))
.collect(TreeMap::new,
(rank, item) -> {
V property = propertyExtractor.apply(item);
if(rank.isEmpty()){
rank.put(new Integer(1),new LinkedList<T>());
}else{
Integer r = rank.lastKey();
List<T> items = rank.get(r);
if(! property.equals(propertyExtractor.apply(items.get(0)))) {
rank.put(r+items.size(), new LinkedList<T>());
}
}
rank.get(rank.lastKey()).add(item);
},
(rank1,rank2) -> {
int lastRanking = rank1.lastKey();
int offset = lastRanking + rank1.get(lastRanking).size()-1;
if( propertyExtractor.apply(rank1.get(lastRanking).get(0))
== propertyExtractor.apply(rank2.get(rank2.firstKey()).get(0))){
rank1.get(lastRanking).addAll(rank2.get(rank2.firstKey()));
rank2.remove(rank2.firstKey());
}
rank2.forEach((r,items) -> {rank1.put(offset+r, items);} );
}
);
}
Run Code Online (Sandbox Code Playgroud)
The above method can be applied as:
SortedMap<Integer,List<String>> ranking =
rank(words,String::length,reverseOrder());
Run Code Online (Sandbox Code Playgroud)
The approach based on the grouping by property value can be encapsulated in a collector:
static <T,V> Collector<T,?,SortedMap<Integer,List<T>>>
rankingCollector(Function<T,V> propertyExtractor,
Comparator<V> propertyComparator){
return
collectingAndThen(
groupingBy(propertyExtractor,
()->new TreeMap<>(propertyComparator),
toList()),
map -> map.entrySet().stream().collect(
TreeMap::new,
(rank,entry) ->
rank.put(rank.isEmpty()?1:rank.lastKey()+
rank.get(rank.lastKey()).size(),
entry.getValue() ),
(rank1,rank2) -> {
int lastRanking = rank1.lastKey();
int offset = lastRanking + rank1.get(lastRanking).size()-1;
if( propertyExtractor.apply(rank1.get(lastRanking).get(0))
==
propertyExtractor.apply(rank2.get(rank2.firstKey()).get(0))){
rank1.get(lastRanking).addAll(rank2.get(rank2.firstKey()));
rank2.remove(rank2.firstKey());
}
rank2.forEach((r,items) -> {rank1.put(offset+r, items);} );
}
)
);
}
Run Code Online (Sandbox Code Playgroud)
This collector factory method can be used e.g. as:
SortedMap<Integer,List<String>> ranking =
words.collect(rankingCollector(String::length,reverseOrder()));
Run Code Online (Sandbox Code Playgroud)
Once the ranking has been computed and stored in the map, it can be printed, typically for debug purposes.
Here are a few possible options for printing the ranking on the console.
Consumer
Using a consumer object that takes the ranking, format the entries and print them. The following code reports a factory method the returns such a consumer:
static <T,V> Consumer<Map<Integer,List<T>>>
rankPrinter(Function<T,V> propertyExtractor){
return ranking ->
ranking.entrySet().stream()
.map( e -> e.getValue().stream().map(
v -> e.getKey() + " - " + v
+ " (" + propertyExtractor.apply(v) + ")" ) )
.flatMap(Function.identity())
.forEach(System.out::println);
}
Run Code Online (Sandbox Code Playgroud)
Function
Using a function that transform the ranking map into a string consisting in the concatenation of the items.
static <T,V> Function<Map<Integer,List<T>>,String>
rankCollator(Function<T,V> propertyExtractor){
return ranking ->
ranking.entrySet().stream()
.map( e -> (Stream<String>)e.getValue().stream().
map( v -> (String)(e.getKey() + " : " + v + " (" + propertyExtractor.apply(v) + ")") ))
.flatMap(Function.identity())
.collect(joining("\n"));
}
Run Code Online (Sandbox Code Playgroud)
The above method can be used as follows:
System.out.println(rankCollator(propertyExtractor).apply(ranking));
Run Code Online (Sandbox Code Playgroud)
Map
classAs an alternative, is is possible to replace the class TreeMap
with a new class that extends it and overrides the toString()
method.
This cane be done by writing the following collector accumulator
supplier, instead of TreeMap::new
:
()->new TreeMap<Integer,List<T>>(){
public String toString(){
return entrySet().stream()
.map( e -> (Stream<String>)e.getValue().stream().map(
v -> e.getKey().toString() + " - " + v.toString()
+ " (" + propertyExtractor.apply(v) + ")" ) )
.flatMap(Function.identity())
.collect(joining("\n"));
}
}
Run Code Online (Sandbox Code Playgroud)
The complete code for the generics solution is available in class StreamRanking available on my github repo.
不像Design First我自己解决的那样,我是通过TDD方式做到的.
比方说,排名"foo"
的意志表示["1 - foo (3)"]
,我解决它通过使用Stream.map(功能)和假恒定的排名1
.
Arrays.stream(sentence.split(" "))
.map(it -> String.format("%d - %s (%d)", 1, it, it.length()))
.collect(toList());
Run Code Online (Sandbox Code Playgroud)
让我们说排名"foo bar"
将表示为["1 - bar (3)","1 - foo (3)"]
.我通过Stream.sorted(比较器)解决了它 ;
Arrays.stream(sentence.split(" "))
.sorted(String::compareTo)
.map(it -> String.format("%d - %s (%d)", 1, it, it.length()))
.collect(toList());
Run Code Online (Sandbox Code Playgroud)
让我们说排名"fuzz bar"
将表示为["1 - fuzz (4)","2 - bar (3)"]
,我通过Comparator.comparing(Function)解决了它.
Arrays.stream(sentence.split(" "))
.sorted(Comparator.comparing(String::length).reversed()
.thenComparing(String::compareTo))
.map(it -> String.format("%d - %s (%d)", 1, it, it.length()))
.collect(toList());
Run Code Online (Sandbox Code Playgroud)
结果["1 - fuzz(4)","1 ~~ 2 ~~ - bar(3)"]是相同的,除了伪造的等级常数.1
所以我在改变代码之前做了一个小的重构:
Arrays.stream(sentence.split(" "))
.sorted(Comparator.comparing(String::length).reversed()
.thenComparing(String::compareTo))
.map(it -> new AbstractMap.SimpleEntry<>(1,it))
.map(it -> String.format("%d - %s (%d)", it.getKey(), it.getValue(), it.getValue().length()))
.collect(toList());
Run Code Online (Sandbox Code Playgroud)
然后我可以用Stream.collect(...). stream()替换第一个Stream.map(Function),并用.替换伪造的等级常量.1
ranking.size() + 1
Arrays.stream(sentence.split(" "))
.sorted(Comparator.comparing(String::length).reversed()
.thenComparing(String::compareTo))
.collect(ArrayList<Map.Entry<Integer, String>>::new
, (ranking, it) -> ranking.add(new AbstractMap.SimpleEntry<>(ranking.size() + 1, it))
, List::addAll).stream()
.map(it -> String.format("%d - %s (%d)", it.getKey(), it.getValue(), it.getValue().length()))
.collect(toList());
Run Code Online (Sandbox Code Playgroud)
但是第2步因为rank = ranking.size() + 1
; 然后我意识到我需要将它的长度与最后一个排名项进行比较.
BiConsumer<List<Map.Entry<Integer, String>>, String> accumulator = (ranking, it) -> {
int rank = ranking.size() + 1;
if (!ranking.isEmpty()) {
Map.Entry<Integer, String> last = ranking.get(ranking.size() - 1);
if (last.getValue().length() == it.length()) {
rank = last.getKey();
}
}
ranking.add(new AbstractMap.SimpleEntry<>(rank, it));
};
List<String> ranking = Arrays.stream(sentence.split(" "))
.sorted(Comparator.comparing(String::length).reversed()
.thenComparing(String::compareTo))
.collect(ArrayList::new, accumulator, List::addAll).stream()
.map(it -> String.format("%d - %s (%d)", it.getKey(), it.getValue(), it.getValue().length()))
.collect(toList());
Run Code Online (Sandbox Code Playgroud)
实际上,我发现我可以进一步用Stack替换ArrayList:
BiConsumer<Stack<Map.Entry<Integer, String>>, String> accumulator = (ranking, it) -> {
int rank = ranking.size() + 1;
if (!ranking.isEmpty()) {
Map.Entry<Integer, String> last = ranking.peek();
if (last.getValue().length() == it.length()) {
rank = last.getKey();
}
}
ranking.add(new AbstractMap.SimpleEntry<>(rank, it));
};
List<String> ranking = Arrays.stream(sentence.split(" "))
.sorted(Comparator.comparing(String::length).reversed()
.thenComparing(String::compareTo))
.collect(Stack::new, accumulator, List::addAll).stream()
.map(it -> String.format("%d - %s (%d)", it.getKey(), it.getValue(), it.getValue().length()))
.collect(toList());
Run Code Online (Sandbox Code Playgroud)
完成代码后,我发现代码中存在一个错误.让我们说排名"fuzz buzz foo"
将表示为["1 - buzz (4)","1 - fuzz (4)","2 - foo (3)"]
,我通过计算排名来解决它last.rank + 1
而不是ranking.size() + 1
.
BiConsumer<Stack<Entry<Integer, String>>, String> accumulator = (ranking, it) -> {
int rank;
if (!ranking.isEmpty()) {
Entry<Integer, String> last = ranking.peek();
if (last.getValue().length() == it.length()) {
rank = last.getKey();
} else {
rank = last.getKey() + 1;
}
} else {
rank = 1;
}
ranking.add(new AbstractMap.SimpleEntry<>(rank, it));
};
List<String> ranking = Arrays.stream(sentence.split(" "))
.sorted(comparing(String::length).reversed()
.thenComparing(String::compareTo))
.collect(Stack::new, accumulator, List::addAll).stream()
.map(it -> format("%d - %s (%d)", it.getKey(), it.getValue(), it.getValue().length()))
.collect(toList());
Run Code Online (Sandbox Code Playgroud)
我accumulator
用一个有意义的名称重命名变量,例如:rankingEvaluation
,并null
用Null Object和.etc 替换.
Map.Entry<Integer, String> NONE = new AbstractMap.SimpleEntry<>(0, "");
BiConsumer<Stack<Entry<Integer, String>>, String> rankingEvaluation = (ranking, it) -> {
Entry<Integer, String> last = ranking.isEmpty() ? NONE : ranking.peek();
int rank = last.getValue().length() == it.length()
? last.getKey()
: last.getKey() + 1;
ranking.add(new AbstractMap.SimpleEntry<>(rank, it));
};
List<String> ranking = Arrays.stream(sentence.split(" "))
.sorted(comparing(String::length).reversed()
.thenComparing(String::compareTo))
.collect(Stack::new, rankingEvaluation, List::addAll).stream()
.map(it -> format("%d - %s (%d)", it.getKey(), it.getValue(), it.getValue().length()))
.collect(toList());
Run Code Online (Sandbox Code Playgroud)