why*_*ame 5 java lucene nlp tokenize stop-words
我试图用Lucene标记并删除txt文件中的停用词.我有这个:
public String removeStopWords(String string) throws IOException {
Set<String> stopWords = new HashSet<String>();
stopWords.add("a");
stopWords.add("an");
stopWords.add("I");
stopWords.add("the");
TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_43, new StringReader(string));
tokenStream = new StopFilter(Version.LUCENE_43, tokenStream, stopWords);
StringBuilder sb = new StringBuilder();
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
while (tokenStream.incrementToken()) {
if (sb.length() > 0) {
sb.append(" ");
}
sb.append(token.toString());
System.out.println(sb);
}
return sb.toString();
}}
Run Code Online (Sandbox Code Playgroud)
我的主要看起来像这样:
String file = "..../datatest.txt";
TestFileReader fr = new TestFileReader();
fr.imports(file);
System.out.println(fr.content);
String text = fr.content;
Stopwords stopwords = new Stopwords();
stopwords.removeStopWords(text);
System.out.println(stopwords.removeStopWords(text));
Run Code Online (Sandbox Code Playgroud)
这给了我一个错误,但我无法弄清楚为什么.
我有同样的问题.要使用Lucene您删除停用词,可以使用该方法使用默认停止设置EnglishAnalyzer.getDefaultStopSet();.否则,您可以创建自己的自定义停用词列表.
下面的代码显示了您的正确版本removeStopWords():
public static String removeStopWords(String textFile) throws Exception {
CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet();
TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_48, new StringReader(textFile.trim()));
tokenStream = new StopFilter(Version.LUCENE_48, tokenStream, stopWords);
StringBuilder sb = new StringBuilder();
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
String term = charTermAttribute.toString();
sb.append(term + " ");
}
return sb.toString();
}
Run Code Online (Sandbox Code Playgroud)
要使用自定义停用词列表,请使用以下命令:
//CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet(); //this is Lucene set
final List<String> stop_Words = Arrays.asList("fox", "the");
final CharArraySet stopSet = new CharArraySet(Version.LUCENE_48, stop_Words, true);
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
10906 次 |
| 最近记录: |