Nul*_*ion 11 python nltk difflib scikit-learn
我有一个句子列表,如:
errList = [ 'Ragu ate lunch but didnt have Water for drinks',
'Rams ate lunch but didnt have Gatorade for drinks',
'Saya ate lunch but didnt have :water for drinks',
'Raghu ate lunch but didnt have water for drinks',
'Hanu ate lunch but didnt have -water for drinks',
'Wayu ate lunch but didnt have water for drinks',
'Viru ate lunch but didnt have .water 4or drinks',
'kk ate lunch & icecream but did have Water for drinks',
'M ate lunch &and icecream but did have Gatorade for drinks',
'Parker ate lunch icecream but didnt have :water for drinks',
'Sassy ate lunch and icecream but didnt have water for drinks',
'John ate lunch and icecream but didnt have -water for drinks',
'Pokey ate lunch and icecream but didnt have Water for drinks',
'Laila ate lunch and icecream but did have water 4or drinks',
]
Run Code Online (Sandbox Code Playgroud)
我想找出列表中每个元素中句子的最长短语/部分(短语必须超过2个单词)的数量?在下面的示例中,输出将更接近于此(最长的短语作为键并计为值):
{ 'ate lunch but didnt have': 7,
'water for drinks': 7,
'ate lunch and icecream': 4,
'didnt have water': 3,
'didnt have Water': 2 # case sensitives
}
Run Code Online (Sandbox Code Playgroud)
使用re模块是不可能的,因为问题接近序列匹配或者可能使用nltk或者scikit-learn?我对NLP和scikit有一定的了解,但还不足以解决这个问题?如果我解决这个问题,我会在这里发布.
scikit-learn有一点numpyfoo也不会太痛苦.但是请注意,这里我只是预处理的默认值,如果您对数据集中的标点符号感兴趣,那么您需要调整它.
from sklearn.feature_extraction.text import CountVectorizer
# Find all the phrases >2 up to the max length
cv = CountVectorizer(ngram_range=(3, max([len(x.split(' ')) for x in errList])))
# Get the counts of the phrases
err_counts = cv.fit_transform(errList)
# Get the sum of each of the phrases
err_counts = err_counts.sum(axis=0)
# Mess about with the types, sparsity is annoying
err_counts = np.squeeze(np.asarray(err_counts))
# Retrieve the actual phrases that we're working with
feat_names = np.array(cv.get_feature_names())
# We don't have to sort here, but it's nice to if you want to print anything
err_counts_sorted = err_counts.argsort()[::-1]
feat_names = feat_names[err_counts_sorted]
err_counts = err_counts[err_counts_sorted]
# This is the dictionary that you were after
err_dict = dict(zip(feat_names, err_counts))
Run Code Online (Sandbox Code Playgroud)
这是前几名的输出
11 but didnt have
10 have water for drinks
10 have water for
10 water for drinks
10 but didnt have water
10 didnt have water
9 but didnt have water for drinks
9 but didnt have water for
9 didnt have water for drinks
9 didnt have water for
Run Code Online (Sandbox Code Playgroud)
如果您不想打扰外部库,只需使用stdlib即可完成此任务(尽管它可能比某些替代方案慢):
import collections
import itertools
def gen_ngrams(sentence):
words = sentence.split() # or re.findall('\b\w+\b'), or whatever
n_words = len(words)
for i in range(n_words - 2):
for j in range(i + 3, n_words):
yield ' '.join(words[i: j]) # Assume normalization of spaces
def count_ngrams(sentences):
return collections.Counter(
itertools.chain.from_iterable(
gen_ngrams(sentence) for sentence in sentences
)
)
counts = count_ngrams(errList)
dict(counts.most_common(10))
Run Code Online (Sandbox Code Playgroud)
哪个让你:
{'but didnt have': 11,
'ate lunch but': 7,
'ate lunch but didnt': 7,
'ate lunch but didnt have': 7,
'lunch but didnt': 7,
'lunch but didnt have': 7,
'icecream but didnt': 4,
'icecream but didnt have': 4,
'ate lunch and': 4,
'ate lunch and icecream': 4}
Run Code Online (Sandbox Code Playgroud)