Gre*_*dot 5 python machine-learning
我想在文本段落的语料库中找到各种列入黑名单的术语.每个术语长约1-5个字,并包含我在文档语料库中不需要的某些关键字.如果在语料库中识别出与其类似的术语或类似内容,我希望将其从我的语料库中删除.
除了删除,我正在努力准确地识别我的语料库中的这些术语.我正在使用scikit-learn并尝试了两种单独的方法:
使用tf-idf向量特征的MultinomialNB分类方法,混合使用黑名单术语和用作训练数据的干净术语.
OneClassSVM方法仅将列入黑名单的关键字用作训练数据,并且传入的任何文本似乎与列入黑名单的术语不相似,都被视为异常值.
这是我的OnceClassSVm方法的代码:
df = pd.read_csv("keyword_training_blacklist.csv")
keywords_list = df['Keyword']
pipeline = Pipeline([
('vect', CountVectorizer(analyzer='char_wb', max_df=0.75, min_df=1, ngram_range=(1, 5))),
# strings to token integer counts
('tfidf', TfidfTransformer(use_idf=False, norm='l2')), # integer counts to weighted TF-IDF scores
('clf', OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)), # train on TF-IDF vectors w/ Naive Bayes classifier
])
kf = KFold(len(keywords_list), 8)
for train_index, test_index in kf:
# make training and testing datasets
X_train, X_test = keywords_list[train_index], keywords_list[test_index]
pipeline.fit(X_train) # Train classifier using training data and labels
predicted = pipeline.predict(X_test)
print(predicted[predicted == 1].size / predicted.size)
csv_df = pd.read_csv("corpus.csv")
testCorpus = csv_df['Terms']
testCorpus = testCorpus.drop_duplicates()
for s in testCorpus:
if pipeline.predict([s])[0] == 1:
print(s)
Run Code Online (Sandbox Code Playgroud)
实际上,当我尝试将语料库传递给算法时,我会得到许多误报.我列入黑名单的学期训练数据约为3000学期.我的训练数据的大小是否需要进一步增加,或者我是否遗漏了明显的东西?
尝试使用difflib
来识别语料库中与每个黑名单术语最接近的匹配项。
import difflib
from nltk.util import ngrams
words = corpus.split(' ') # split corpus to words based on spaces ( can be improved )
words_ngrams = [] # ngrams from 1 to 5 words
for n in range(1,6):
words_ngrams.extend( ' '.join(ngrams(words, n)) )
to_delete = [] # will contain tuples (index, length) of matched terms to delete from corpus.
sim_rate = 0.8 # similarity rate
max_matches = 4 # maximum number of matches for each term
for term in terms:
matches = difflib.get_close_matches(term,words_ngrams,n=max_matches,cutoff=sim_rate)
for match in matches:
to_delete.append( (corpus.index(match), len(match) ) )
Run Code Online (Sandbox Code Playgroud)
difflib.SequenceMatcher
如果您想获得术语和 ngram 之间的相似度分数,也可以使用。