sklearn tfidf 矢量化器 - 如果存在 n 克,则删除 n-2 和 n-1 克

she*_*th7 3 python n-gram scikit-learn tfidfvectorizer

我正在使用 sklearn 的 tfidf-vectorizer 来创建文档特征矩阵和特征术语列表。

如果 n-gram 已经存在,我不想重复 n-1 和 n-2 克。IE,for an example sentence: The quick brown fox jumps over the fence

我想要not include条款'fox' and 'brown fox' if 'quick brown fox' exists.

我的假设是,重复标记会导致特征集人为扩展,并扭曲其他任务(例如聚类)的结果。

小智 6

我知道这不是一个有效的方法,但这就是我所做的。最后使用 pandas 系列只是将数组与所选索引进行子集化。

def removeSubgrams(features):
  # Sort features based on length of the n-gram
  features = sorted(features , key=lambda x:len(x.split(" ")))

  to_remove = []

  # Iterate over all features
  for i,subfeature in enumerate(features):
    for j,longerfeature in enumerate(features[i+1:]):
      if longerfeature.find(subfeature) > -1:
        to_remove.append(i)
        # break if subfeature is a substring of longerfeature
        break
  features = pd.Series(features)
  # keep only those features that are not in to_remove
  features = features.loc[~features.index.isin(to_remove)]
  return features
Run Code Online (Sandbox Code Playgroud)