sklearn tfidf 矢量化器 - 如果存在 n 克，则删除 n-2 和 n-1 克

Question

sklearn tfidf 矢量化器 - 如果存在 n 克，则删除 n-2 和 n-1 克

she*_*th7 3 python n-gram scikit-learn tfidfvectorizer

我正在使用 sklearn 的 tfidf-vectorizer 来创建文档特征矩阵和特征术语列表。

如果 n-gram 已经存在，我不想重复 n-1 和 n-2 克。IE，for an example sentence: The quick brown fox jumps over the fence。

我想要not include条款'fox' and 'brown fox' if 'quick brown fox' exists.

我的假设是，重复标记会导致特征集人为扩展，并扭曲其他任务（例如聚类）的结果。

Answer 1

小智 6

我知道这不是一个有效的方法，但这就是我所做的。最后使用 pandas 系列只是将数组与所选索引进行子集化。

def removeSubgrams(features):
  # Sort features based on length of the n-gram
  features = sorted(features , key=lambda x:len(x.split(" ")))

  to_remove = []

  # Iterate over all features
  for i,subfeature in enumerate(features):
    for j,longerfeature in enumerate(features[i+1:]):
      if longerfeature.find(subfeature) > -1:
        to_remove.append(i)
        # break if subfeature is a substring of longerfeature
        break
  features = pd.Series(features)
  # keep only those features that are not in to_remove
  features = features.loc[~features.index.isin(to_remove)]
  return features

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，3 月前
查看次数：	1323 次
最近记录：	2 年，11 月前