Ove*_*ass 6 python sparse-matrix dataframe pandas scikit-learn
问题:将sklearn的CountVectorizer和TfidfTransformer导致的稀疏矩阵转换为Pandas DataFrame列的最佳方法是什么,每个bigram及其相应的频率和tf-idf得分都有一个单独的行?
管道:从SQL DB中提取文本数据,将文本拆分为双字节并计算每个文档的频率和每个文档的每个文件的tf-idf,将结果加载回SQL DB.
当前状态:
引入两列数据(number,text).text清洁后产生第三列cleanText:
number text cleanText
0 123 The farmer plants grain farmer plants grain
1 234 The farmer and his son go fishing farmer son go fishing
2 345 The fisher catches tuna fisher catches tuna
Run Code Online (Sandbox Code Playgroud)
这个DataFrame被输入到sklearn的特征提取中:
cv = CountVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')
dt_mat = cv.fit_transform(data.cleanText)
tfidf_transformer = TfidfTransformer()
tfidf_mat = tfidf_transformer.fit_transform(dt_mat)
Run Code Online (Sandbox Code Playgroud)
然后在将矩阵转换为数组后将矩阵反馈到原始DataFrame中:
data['frequency'] = list(dt_mat.toarray())
data['tfidf_score']=list(tfidf_mat.toarray())
Run Code Online (Sandbox Code Playgroud)
输出:
number text cleanText \
0 123 The farmer plants grain farmer plants grain
1 234 The farmer and his son go fishing farmer son go fishing
2 345 The fisher catches tuna fisher catches tuna
frequency tfidf_score
0 [0, 1, 0, 0, 0, 1, 0] [0.0, 0.707106781187, 0.0, 0.0, 0.0, 0.7071067...
1 [0, 0, 1, 0, 1, 0, 1] [0.0, 0.0, 0.57735026919, 0.0, 0.57735026919, ...
2 [1, 0, 0, 1, 0, 0, 0] [0.707106781187, 0.0, 0.0, 0.707106781187, 0.0...
Run Code Online (Sandbox Code Playgroud)
问题:
frequency和tfidf_score不在单独的行为每个两字组期望的输出:
number bigram frequency tfidf_score
0 123 farmer plants 1 0.70
0 123 plants grain 1 0.56
1 234 farmer son 1 0.72
1 234 son go 1 0.63
1 234 go fishing 1 0.34
2 345 fisher catches 1 0.43
2 345 catches tuna 1 0.43
Run Code Online (Sandbox Code Playgroud)
我设法使用以下代码获取分配给DataFrame的单独行的数字列之一:
data.reset_index(inplace=True)
rows = []
_ = data.apply(lambda row: [rows.append([row['number'], nn])
for nn in row.tfidf_score], axis=1)
df_new = pd.DataFrame(rows, columns=['number', 'tfidf_score'])
Run Code Online (Sandbox Code Playgroud)
输出:
number tfidf_score
0 123 0.000000
1 123 0.707107
2 123 0.000000
3 123 0.000000
4 123 0.000000
5 123 0.707107
6 123 0.000000
7 234 0.000000
8 234 0.000000
9 234 0.577350
10 234 0.000000
11 234 0.577350
12 234 0.000000
13 234 0.577350
14 345 0.707107
15 345 0.000000
16 345 0.000000
17 345 0.707107
18 345 0.000000
19 345 0.000000
20 345 0.000000
Run Code Online (Sandbox Code Playgroud)
但是,我不确定如何为两个数字列执行此操作,并且这不会引入bigrams(功能名称)本身.此外,这个方法需要一个数组(这就是我首先将稀疏矩阵转换为数组的原因),如果可能的话,我想避免这种情况,因为性能问题以及我必须剥离无意义的行.
非常感谢任何见解!非常感谢你花时间阅读这个问题 - 我为这个问题道歉.如果我能做些什么来改进问题或澄清我的过程,请告诉我.
CountVectorizer可以使用s捕获二元组名称get_feature_names()。从那里开始,它只是一系列的melt和merge操作:
print(data)
number text cleanText
0 123 The farmer plants grain farmer plants grain
1 234 The farmer and his son go fishing farmer son go fishing
2 345 The fisher catches tuna fisher catches tuna
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
cv = CountVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')
dt_mat = cv.fit_transform(data.cleanText)
tfidf_transformer = TfidfTransformer()
tfidf_mat = tfidf_transformer.fit_transform(dt_mat)
Run Code Online (Sandbox Code Playgroud)
CountVectorizer在本例中,功能名称是二元组:
print(cv.get_feature_names())
[u'catches tuna',
u'farmer plants',
u'farmer son',
u'fisher catches',
u'go fishing',
u'plants grain',
u'son go']
Run Code Online (Sandbox Code Playgroud)
CountVectorizer.fit_transform()返回稀疏矩阵。我们可以将其转换为密集表示,将其包装在 a 中DataFrame,然后将特征名称附加为列:
bigrams = pd.DataFrame(dt_mat.todense(), index=data.index, columns=cv.get_feature_names())
bigrams['number'] = data.number
print(bigrams)
catches tuna farmer plants farmer son fisher catches go fishing \
0 0 1 0 0 0
1 0 0 1 0 1
2 1 0 0 1 0
plants grain son go number
0 1 0 123
1 0 1 234
2 0 0 345
Run Code Online (Sandbox Code Playgroud)
要从宽格式变为长格式,请使用melt().
然后将结果限制为二元匹配(query()此处很有用):
bigrams_long = (pd.melt(bigrams.reset_index(),
id_vars=['index','number'],
value_name='bigram_ct')
.query('bigram_ct > 0')
.sort_values(['index','number']))
index number variable bigram_ct
3 0 123 farmer plants 1
15 0 123 plants grain 1
7 1 234 farmer son 1
13 1 234 go fishing 1
19 1 234 son go 1
2 2 345 catches tuna 1
11 2 345 fisher catches 1
Run Code Online (Sandbox Code Playgroud)
现在重复该过程tfidf:
tfidf = pd.DataFrame(tfidf_mat.todense(), index=data.index, columns=cv.get_feature_names())
tfidf['number'] = data.number
tfidf_long = pd.melt(tfidf.reset_index(),
id_vars=['index','number'],
value_name='tfidf').query('tfidf > 0')
Run Code Online (Sandbox Code Playgroud)
最后,合并bigrams和tfidf:
fulldf = (bigrams_long.merge(tfidf_long,
on=['index','number','variable'])
.set_index('index'))
number variable bigram_ct tfidf
index
0 123 farmer plants 1 0.707107
0 123 plants grain 1 0.707107
1 234 farmer son 1 0.577350
1 234 go fishing 1 0.577350
1 234 son go 1 0.577350
2 345 catches tuna 1 0.707107
2 345 fisher catches 1 0.707107
Run Code Online (Sandbox Code Playgroud)