在Java中删除String中的停用词

Jav*_*ner 7 java string stop-words

我有一个包含大量单词的字符串,我有一个文本文件,其中包含一些需要从我的字符串中删除的停用词.假设我有一个字符串

s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."
Run Code Online (Sandbox Code Playgroud)

删除停用词后,字符串应为:

"love phone, super fast much cool jelly bean....but recently bugs."
Run Code Online (Sandbox Code Playgroud)

我已经能够实现这一点,但我面临的问题是,当字符串中有相邻的停用词时,它只删除第一个,我得到的结果如下:

"love phone, super fast there's much and cool with jelly bean....but recently seen bugs"  
Run Code Online (Sandbox Code Playgroud)

这是我的stopwordslist.txt文件:停用词

我怎么解决这个问题.这是我到目前为止所做的:

int k=0,i,j;
ArrayList<String> wordsList = new ArrayList<String>();
String sCurrentLine;
String[] stopwords = new String[2000];
try{
        FileReader fr=new FileReader("F:\\stopwordslist.txt");
        BufferedReader br= new BufferedReader(fr);
        while ((sCurrentLine = br.readLine()) != null){
            stopwords[k]=sCurrentLine;
            k++;
        }
        String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
        StringBuilder builder = new StringBuilder(s);
        String[] words = builder.toString().split("\\s");
        for (String word : words){
            wordsList.add(word);
        }
        for(int ii = 0; ii < wordsList.size(); ii++){
            for(int jj = 0; jj < k; jj++){
                if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
                    wordsList.remove(ii);
                    break;
                }
             }
        }
        for (String str : wordsList){
            System.out.print(str+" ");
        }   
    }catch(Exception ex){
        System.out.println(ex);
    }
Run Code Online (Sandbox Code Playgroud)

gee*_*rt3 5

这是一个更优雅的解决方案(恕我直言),只使用正则表达式:

    // instead of the ".....", add all your stopwords, separated by "|"
    // "\\b" is to account for word boundaries, i.e. not replace "his" in "this"
    // the "\\s?" is to suppress optional trailing white space
    Pattern p = Pattern.compile("\\b(I|this|its.....)\\b\\s?");
    Matcher m = p.matcher("I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.");
    String s = m.replaceAll("");
    System.out.println(s);
Run Code Online (Sandbox Code Playgroud)


ala*_*inm 3

该错误是因为您从迭代的列表中删除了元素。假设您有wordsList包含|word0|word1|word2| If iiis equal to1并且 if 测试为 true,则您调用wordsList.remove(1);。之后你的清单是|word0|word2|ii然后递增并等于2,现在它超出了列表的大小,因此word2永远不会被测试。

从那里有几种解决方案。例如,您可以将值设置为“”,而不是删除值。或者创建一个特殊的“结果”列表。