我试图用Lucene标记并删除txt文件中的停用词.我有这个:
public String removeStopWords(String string) throws IOException {
Set<String> stopWords = new HashSet<String>();
stopWords.add("a");
stopWords.add("an");
stopWords.add("I");
stopWords.add("the");
TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_43, new StringReader(string));
tokenStream = new StopFilter(Version.LUCENE_43, tokenStream, stopWords);
StringBuilder sb = new StringBuilder();
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
while (tokenStream.incrementToken()) {
if (sb.length() > 0) {
sb.append(" ");
}
sb.append(token.toString());
System.out.println(sb);
}
return sb.toString();
}}
Run Code Online (Sandbox Code Playgroud)
我的主要看起来像这样:
String file = "..../datatest.txt";
TestFileReader fr = new TestFileReader();
fr.imports(file);
System.out.println(fr.content);
String text = fr.content;
Stopwords stopwords = new Stopwords();
stopwords.removeStopWords(text);
System.out.println(stopwords.removeStopWords(text));
Run Code Online (Sandbox Code Playgroud)
这给了我一个错误,但我无法弄清楚为什么.