Bon*_*son 10 python pandas scikit-learn sklearn-pandas
我有我想在模型中使用的X变量的其他派生值.
XAll = pd_data[['title','wordcount','sumscores','length']]
y = pd_data['sentiment']
X_train, X_test, y_train, y_test = train_test_split(XAll, y, random_state=1)
Run Code Online (Sandbox Code Playgroud)
当我在标题中处理文本数据时,我首先将其分别转换为dtm:
vect = CountVectorizer(max_df=0.5)
vect.fit(X_train['title'])
X_train_dtm = vect.transform(X_train['title'])
column_index = X_train_dtm.indices
print(type(X_train_dtm)) # This is <class 'scipy.sparse.csr.csr_matrix'>
print("X_train_dtm shape",X_train_dtm.get_shape()) # This is (856, 2016)
print("column index:",column_index) # This is column index: [ 533 754 859 ..., 633 950 1339]
Run Code Online (Sandbox Code Playgroud)
现在我将文本作为文档术语矩阵,我想将其他功能添加到X_train_dtm这些数字中,例如'wordcount','sumscores','length'.我将使用新的dtm创建模型,因此我将插入附加功能更准确.
如何将pandas数据帧的其他数字列添加到稀疏csr矩阵?
Bon*_*son 13
找到了解决方案.我们可以使用sparse.hstack来做到这一点:
from scipy.sparse import hstack
X_train_dtm = hstack((X_train_dtm,np.array(X_train['wordcount'])[:,None]))
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
3502 次 |
| 最近记录: |