she*_*th7 3 python n-gram scikit-learn tfidfvectorizer
我正在使用 sklearn 的 tfidf-vectorizer 来创建文档特征矩阵和特征术语列表。
如果 n-gram 已经存在,我不想重复 n-1 和 n-2 克。IE,for an example sentence: The quick brown fox jumps over the fence。
我想要not include条款'fox' and 'brown fox' if 'quick brown fox' exists.
我的假设是,重复标记会导致特征集人为扩展,并扭曲其他任务(例如聚类)的结果。
小智 6
我知道这不是一个有效的方法,但这就是我所做的。最后使用 pandas 系列只是将数组与所选索引进行子集化。
def removeSubgrams(features):
# Sort features based on length of the n-gram
features = sorted(features , key=lambda x:len(x.split(" ")))
to_remove = []
# Iterate over all features
for i,subfeature in enumerate(features):
for j,longerfeature in enumerate(features[i+1:]):
if longerfeature.find(subfeature) > -1:
to_remove.append(i)
# break if subfeature is a substring of longerfeature
break
features = pd.Series(features)
# keep only those features that are not in to_remove
features = features.loc[~features.index.isin(to_remove)]
return features
Run Code Online (Sandbox Code Playgroud)