我正在尝试构建一个矩阵,其中第一行是词性,第一列是句子。矩阵中的值应显示句子中此类 POS 的数量。
所以我以这种方式创建 POS 标签:
data = pd.read_csv(open('myfile.csv'),sep=';')
target = data["label"]
del data["label"]
data.sentence = data.sentence.str.lower() # All strings in data frame to lowercase
for line in data.sentence:
Line_new= nltk.pos_tag(nltk.word_tokenize(line))
print(Line_new)
Run Code Online (Sandbox Code Playgroud)
输出是:
[('together', 'RB'), ('with', 'IN'), ('the', 'DT'), ('6th', 'CD'), ('battalion', 'NN'), ('of', 'IN'), ('the', 'DT')]
Run Code Online (Sandbox Code Playgroud)
如何从这样的输出创建我在上面描述的矩阵?
更新:所需的输出是
NN VB IN VBZ DT
I was there 1 1 1 0 0
He came there 0 0 1 1 1
Run Code Online (Sandbox Code Playgroud)
我的文件.csv:
"A child who is exclusively or predominantly oral (using …Run Code Online (Sandbox Code Playgroud) 我正在使用scikit-learn进行文本处理,但我CountVectorizer没有提供我期望的输出.
我的CSV文件如下:
"Text";"label"
"Here is sentence 1";"label1"
"I am sentence two";"label2"
Run Code Online (Sandbox Code Playgroud)
等等.
所以我想首先使用Bag of Words来理解python中的SVM是如何工作的.
import pandas as pd
from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer
data = pd.read_csv(open('myfile.csv'),sep=';')
target = data["label"]
del data["label"]
# Creating Bag of Words
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(data)
X_train_counts.shape
count_vect.vocabulary_.get(u'algorithm')
Run Code Online (Sandbox Code Playgroud)
而当我这样做
print(X_train_counts.shape)
Run Code Online (Sandbox Code Playgroud)
我看到输出(1,1),而我有1048行句子.比我看看输出的
count_vect.vocabulary_.get(u'algorithm')
Run Code Online (Sandbox Code Playgroud)
这是None.
你能告诉我,我做错了吗?我正在学习本教程.