Ada*_*dam 2 java java-8 java-stream
I\xe2\x80\x99d 喜欢使用 Java 8 流来获取字符串流(例如从纯文本文件中读取)并生成句子流。我认为句子可以跨越界限。
\n\n例如,我想从:
\n\n"This is the", "first sentence. This is the", "second sentence."\nRun Code Online (Sandbox Code Playgroud)\n\n到:
\n\n"This is the first sentence.", "This is the second sentence."\nRun Code Online (Sandbox Code Playgroud)\n\n我可以看到它 \xe2\x80\x99s 可以得到句子部分的流,如下所示:
\n\nPattern p = Pattern.compile("\\\\.");\nStream<String> lines\n = Stream.of("This is the", "first sentence. This is the", "second sentence.");\n\nStream<String> result = lines.flatMap(s -> p.splitAsStream(s));\nRun Code Online (Sandbox Code Playgroud)\n\n但后来我\xe2\x80\x99m 不确定如何生成一个流来将片段连接成句子。我想以一种惰性的方式执行此操作,以便仅读取原始流中需要的内容。有任何想法吗?
\n将文本分成句子并不像仅仅寻找点那么容易。例如,您不想 \xe2\x80\x99t 想要在 \xe2\x80\x9cMr.Smith\xe2\x80\x9d\xe2\x80\xa6 之间分割
\n\n值得庆幸的是,已经有一个 JRE 类可以处理这个问题,即BreakIterator. 它没有\xe2\x80\x99t 的Stream支持,因此为了将它与流一起使用,需要一些围绕它的支持代码:
public class SentenceStream extends Spliterators.AbstractSpliterator<String>\nimplements Consumer<CharSequence> {\n\n public static Stream<String> sentences(Stream<? extends CharSequence> s) {\n return StreamSupport.stream(new SentenceStream(s.spliterator()), false);\n }\n Spliterator<? extends CharSequence> source;\n CharBuffer buffer;\n BreakIterator iterator;\n\n public SentenceStream(Spliterator<? extends CharSequence> source) {\n super(Long.MAX_VALUE, ORDERED|NONNULL);\n this.source = source;\n iterator=BreakIterator.getSentenceInstance(Locale.ENGLISH);\n buffer=CharBuffer.allocate(100);\n buffer.flip();\n }\n\n @Override\n public boolean tryAdvance(Consumer<? super String> action) {\n for(;;) {\n int next=iterator.next();\n if(next!=BreakIterator.DONE && next!=buffer.limit()) {\n action.accept(buffer.subSequence(0, next-buffer.position()).toString());\n buffer.position(next);\n return true;\n }\n if(!source.tryAdvance(this)) {\n if(buffer.hasRemaining()) {\n action.accept(buffer.toString());\n buffer.position(0).limit(0);\n return true;\n }\n return false;\n }\n iterator.setText(buffer.toString());\n }\n }\n\n @Override\n public void accept(CharSequence t) {\n buffer.compact();\n if(buffer.remaining()<t.length()) {\n CharBuffer bigger=CharBuffer.allocate(\n Math.max(buffer.capacity()*2, buffer.position()+t.length()));\n buffer.flip();\n bigger.put(buffer);\n buffer=bigger;\n }\n buffer.append(t).flip();\n }\n}\nRun Code Online (Sandbox Code Playgroud)\n\n通过该支持类,您可以简单地说,例如:
\n\nStream<String> lines = Stream.of(\n "This is the ", "first sentence. This is the ", "second sentence.");\nsentences(lines).forEachOrdered(System.out::println);\nRun Code Online (Sandbox Code Playgroud)\n