Java 8 句子流

Ada*_*dam 2 java java-8 java-stream

I\xe2\x80\x99d 喜欢使用 Java 8 流来获取字符串流(例如从纯文本文件中读取)并生成句子流。我认为句子可以跨越界限。

\n\n

例如,我想从:

\n\n
"This is the", "first sentence.  This is the", "second sentence."\n
Run Code Online (Sandbox Code Playgroud)\n\n

到:

\n\n
"This is the first sentence.", "This is the second sentence."\n
Run Code Online (Sandbox Code Playgroud)\n\n

我可以看到它 \xe2\x80\x99s 可以得到句子部分的流,如下所示:

\n\n
Pattern p = Pattern.compile("\\\\.");\nStream<String> lines\n   = Stream.of("This is the", "first sentence.  This is the", "second sentence.");\n\nStream<String> result = lines.flatMap(s -> p.splitAsStream(s));\n
Run Code Online (Sandbox Code Playgroud)\n\n

但后来我\xe2\x80\x99m 不确定如何生成一个流来将片段连接成句子。我想以一种惰性的方式执行此操作,以便仅读取原始流中需要的内容。有任何想法吗?

\n

Hol*_*ger 5

将文本分成句子并不像仅仅寻找点那​​么容易。例如,您不想 \xe2\x80\x99t 想要在 \xe2\x80\x9cMr.Smith\xe2\x80\x9d\xe2\x80\xa6 之间分割

\n\n

值得庆幸的是,已经有一个 JRE 类可以处理这个问题,即BreakIterator. 它没有\xe2\x80\x99t 的Stream支持,因此为了将它与流一起使用,需要一些围绕它的支持代码:

\n\n
public class SentenceStream extends Spliterators.AbstractSpliterator<String>\nimplements Consumer<CharSequence> {\n\n    public static Stream<String> sentences(Stream<? extends CharSequence> s) {\n        return StreamSupport.stream(new SentenceStream(s.spliterator()), false);\n    }\n    Spliterator<? extends CharSequence> source;\n    CharBuffer buffer;\n    BreakIterator iterator;\n\n    public SentenceStream(Spliterator<? extends CharSequence> source) {\n        super(Long.MAX_VALUE, ORDERED|NONNULL);\n        this.source = source;\n        iterator=BreakIterator.getSentenceInstance(Locale.ENGLISH);\n        buffer=CharBuffer.allocate(100);\n        buffer.flip();\n    }\n\n    @Override\n    public boolean tryAdvance(Consumer<? super String> action) {\n        for(;;) {\n            int next=iterator.next();\n            if(next!=BreakIterator.DONE && next!=buffer.limit()) {\n                action.accept(buffer.subSequence(0, next-buffer.position()).toString());\n                buffer.position(next);\n                return true;\n            }\n            if(!source.tryAdvance(this)) {\n                if(buffer.hasRemaining()) {\n                    action.accept(buffer.toString());\n                    buffer.position(0).limit(0);\n                    return true;\n                }\n                return false;\n            }\n            iterator.setText(buffer.toString());\n        }\n    }\n\n    @Override\n    public void accept(CharSequence t) {\n        buffer.compact();\n        if(buffer.remaining()<t.length()) {\n            CharBuffer bigger=CharBuffer.allocate(\n                Math.max(buffer.capacity()*2, buffer.position()+t.length()));\n            buffer.flip();\n            bigger.put(buffer);\n            buffer=bigger;\n        }\n        buffer.append(t).flip();\n    }\n}\n
Run Code Online (Sandbox Code Playgroud)\n\n

通过该支持类,您可以简单地说,例如:

\n\n
Stream<String> lines = Stream.of(\n    "This is the ", "first sentence. This is the ", "second sentence.");\nsentences(lines).forEachOrdered(System.out::println);\n
Run Code Online (Sandbox Code Playgroud)\n