Shu*_*wal 10 python-3.x gensim text-classification word2vec
我想使用 word2vec 执行文本分类。我得到了词向量。
ls = []
sentences = lines.split(".")
for i in sentences:
ls.append(i.split())
model = Word2Vec(ls, min_count=1, size = 4)
words = list(model.wv.vocab)
print(words)
vectors = []
for word in words:
vectors.append(model[word].tolist())
data = np.array(vectors)
data
Run Code Online (Sandbox Code Playgroud)
输出:
array([[ 0.00933912, 0.07960335, -0.04559333, 0.10600036],
[ 0.10576613, 0.07267512, -0.10718666, -0.00804013],
[ 0.09459028, -0.09901826, -0.07074171, -0.12022413],
[-0.09893986, 0.01500741, -0.04796079, -0.04447284],
[ 0.04403428, -0.07966098, -0.06460238, -0.07369237],
[ 0.09352681, -0.03864434, -0.01743148, 0.11251986],.....])
Run Code Online (Sandbox Code Playgroud)
我如何进行分类(产品和非产品)?
您已经拥有使用 的词向量数组model.wv.syn0。如果打印它,您可以看到一个数组,其中包含一个单词的每个对应向量。
您可以在此处查看使用Python3的示例:
import pandas as pd
import os
import gensim
import nltk as nl
from sklearn.linear_model import LogisticRegression
#Reading a csv file with text data
dbFilepandas = pd.read_csv('machine learning\\Python\\dbSubset.csv').apply(lambda x: x.astype(str).str.lower())
train = []
#getting only the first 4 columns of the file
for sentences in dbFilepandas[dbFilepandas.columns[0:4]].values:
train.extend(sentences)
# Create an array of tokens using nltk
tokens = [nl.word_tokenize(sentences) for sentences in train]
Run Code Online (Sandbox Code Playgroud)
现在是使用向量模型的时候了,在这个例子中,我们将计算 LogisticRegression。
# method 1 - using tokens in Word2Vec class itself so you don't need to train again with train method
model = gensim.models.Word2Vec(tokens, size=300, min_count=1, workers=4)
# method 2 - creating an object 'model' of Word2Vec and building vocabulary for training our model
model = gensim.models.Word2vec(size=300, min_count=1, workers=4)
# building vocabulary for training
model.build_vocab(tokens)
print("\n Training the word2vec model...\n")
# reducing the epochs will decrease the computation time
model.train(tokens, total_examples=len(tokens), epochs=4000)
# You can save your model if you want....
# The two datasets must be the same size
max_dataset_size = len(model.wv.syn0)
Y_dataset = []
# get the last number of each file. In this case is the department number
# this will be the 0 or 1, or another kind of classification. ( to use words you need to extract them differently, this way is to numbers)
with open("dbSubset.csv", "r") as f:
for line in f:
lastchar = line.strip()[-1]
if lastchar.isdigit():
result = int(lastchar)
Y_dataset.append(result)
else:
result = 40
clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(model.wv.syn0, Y_dataset[:max_dataset_size])
# Prediction of the first 15 samples of all features
predict = clf.predict(model.wv.syn0[:15, :])
# Calculating the score of the predictions
score = clf.score(model.wv.syn0, Y_dataset[:max_dataset_size])
print("\nPrediction word2vec : \n", predict)
print("Score word2vec : \n", score)
Run Code Online (Sandbox Code Playgroud)
您还可以计算属于您创建的模型词典的单词的相似度:
print("\n\nSimilarity value : ",model.wv.similarity('women','men'))
Run Code Online (Sandbox Code Playgroud)
您可以在此处找到更多要使用的功能。
您的问题相当广泛,但我将尝试为您提供第一种对文本文档进行分类的方法。
首先,我将决定如何将每个文档表示为一个向量。因此,您需要一种方法来获取(单词)向量列表并返回一个向量。您希望避免文档的长度影响该向量表示的内容。例如,您可以选择平均值。
def document_vector(array_of_word_vectors):
return array_of_word_vectors.mean(axis=0)
Run Code Online (Sandbox Code Playgroud)
例如array_of_word_vectors,data在您的代码中。
现在,您可以尝试一下距离(例如余弦距离将是一个不错的第一选择)并查看某些文档彼此之间的距离,或者 - 这可能是带来更快结果的方法 - 您可以使用文档向量为您从scikit learn选择的分类算法构建训练集,例如逻辑回归。
文档向量将成为您的矩阵X,您的向量y是一个由 1 和 0 组成的数组,具体取决于您希望将文档分类到的二进制类别。
| 归档时间: |
|
| 查看次数: |
18771 次 |
| 最近记录: |