小编for*_*joe的帖子

从rpy2 DataFrame中选择特定行

我的数据框是我从.csv文件中获得的调查数据.其中一个栏目是年龄,我希望删除所有18岁以下的受访者.然后,我需要将年龄组(18-24,25-35等)分离到他们自己的数据帧中,我可以为其进行频率分配.

x.sub <- subset(x.df, y > 2)

Run Code Online (Sandbox Code Playgroud)

但我无法弄清楚如何使用r()函数将我的数据帧变量从python变为R语句.感觉好像在rpy2 DataFrame类中应该有一个.subset()函数.但如果它存在,我找不到它.

rpy2

for*_*joe

2014 11-13

7
推荐指数

1
解决办法

1894
查看次数

Naive Bayes classfier的文档术语矩阵:意外结果R.

我有一些非常烦人的问题让Naive Bayes分类器与文档术语矩阵一起工作.我确定我犯了一个非常简单的错误,但无法弄清楚它是什么.我的数据来自帐户电子表格.我被要求弄清楚哪些类别(文本格式:主要是部门名称或预算名称)更有可能在慈善机构上花钱,哪些(或者只是)花在私人公司上.他们建议我使用朴素贝叶斯分类器来做到这一点.我有大量的数据来训练一个模型和数十万行来测试模型.我已经准备好了字符串,用下划线替换了空格,用&+替换了&s,然后将每个类别视为一个术语:所以'酒精和毒瘾'成为:酒精+药物滥用.

一些示例行:

"environment+housing strategy+commissioning third_party_payments supporting_ppl_block_gross_chargeable" -> This row went to a charity
"west_north_west customer+tenancy premises h.r.a._special_maintenance" -> This row went to a private company.

Run Code Online (Sandbox Code Playgroud)

使用此示例作为模板,我编写了以下函数来提出我的文档术语矩阵(使用tm),用于训练和测试数据.

library(tm)
library(e1071) 

getMatrix <- function(chrVect){
    testsource <- VectorSource(chrVect)
    testcorpus <- Corpus(testsource)
    testcorpus <- tm_map(testcorpus,stripWhitespace)
    testcorpus <- tm_map(testcorpus, removeWords,stopwords("english"))
    testmatrix <- t(TermDocumentMatrix(testcorpus))
}

trainmatrix <- getMatrix(traindata$cats)
testmatrix <- getMatrix(testdata$cats)

Run Code Online (Sandbox Code Playgroud)

到现在为止还挺好.问题是当我尝试a)应用朴素贝叶斯模型和b)从该模型预测时.使用klar包 - 我得到零概率错误,因为许多术语只有一个类别的零实例并且使用laplace术语似乎不能解决这个问题.使用e1071,该模型有效,但是当我使用以下方法测试模型时:

model <- naiveBayes(as.matrix(trainmatrix),as.factor(traindata$Code))
rs<- predict(model, as.matrix(testdata$cats))

Run Code Online (Sandbox Code Playgroud)

......每个项目预测相同的类别,即使它们应该大致相等.模型中的某些东西显然不起作用.看一下模型$表中的一些术语 - 我可以看到许多私有的值和慈善的零值,反之亦然.我使用as.factor代码.

output:
rs   1 …

Run Code Online (Sandbox Code Playgroud)

r bayesian tm

for*_*joe

lucky-day

6
推荐指数

1
解决办法

2054
查看次数

标签统计

bayesian ×1

r ×1

rpy2 ×1

tm ×1

从rpy2 DataFrame中选择特定行

Naive Bayes classfier的文档术语矩阵:意外结果R.

标签 统计

小编for_joe的帖子

标签统计