在Python中过滤文本数据

Ale*_*lex 1 python

我在解决我在这里做错了什么问题.我有下面的代码(相当简单).

def compileWordList(textList, wordDict):
    '''Function to extract words from text lines exc. stops,
        and add associated line nums'''
    i = 0;
    for row in textList:
        i = i + 1
        words = re.split('\W+', row)
        for wordPart in words:
            word = repr(wordPart)
            word = word.lower()
            if not any(word in s for s in stopsList):
                if word not in wordDict:
                    x = wordLineNumsContainer()
                    x.addLineNum(i)
                    wordDict[word] = x
                elif word in wordDict:
                    lineNumValues = wordDict[word]
                    lineNumValues.addLineNum(i)
                    wordDict[word] = lineNumValues
            elif any(word in s for s in stopsList):
                print(word)
Run Code Online (Sandbox Code Playgroud)

代码从列表中获取字符串(句子).然后使用re.split()方法将整个单词的字符串拆分,返回字符串列表(单词).

然后我将字符串强制为小写字母.然后我希望它在我有一个停止词的列表中检查这个词是否存在(在英语中太常见的词来打扰).来检查,如果该部分word就是在stopsList似乎永远不会工作,因为停止的话在我结束wordDict每一次.我还添加了底部print(word)语句,以检查它是否正在捕捉它们,但没有任何东西被打印:(

在通过的字符串中使用了数百个停用词.

请有人在这里开导我吗?为什么字符串永远不会因为停用词而被过滤?

非常感谢,Alex

Joh*_*iss 7

那个怎么样?

from collections import defaultdict
import re

stop_words = set(['a', 'is', 'and', 'the', 'i'])
text = [ 'This is the first line in my text'
       , 'and this one is the second line in my text'
       , 'I like texts with three lines, so I added that one'
       ]   
word_line_dict = defaultdict(list)

for line_no, line in enumerate(text, 1): 
    words = set(map(str.lower, re.split('\W+', line)))
    words_ok = words.difference(stop_words)
    for wok in words_ok:
        word_line_dict[wok].append(line_no)

print word_line_dict
Run Code Online (Sandbox Code Playgroud)

非常感谢Gnibbler:更好的编写for-loop和更多pythonic方式来处理第一次插入dict的方法.

打印(除了字典的格式)

{ 'added': [3]
, 'like': [3]
, 'that': [3]
, 'this': [1, 2]
, 'text': [1, 2]
, 'lines': [3]
, 'three': [3]
, 'one': [2, 3]
, 'texts': [3]
, 'second': [2]
, 'so': [3]
, 'in': [1, 2]
, 'line': [1, 2]
, 'my': [1, 2]
, 'with': [3]
, 'first': [1]
}
Run Code Online (Sandbox Code Playgroud)