我有一个4列的数据框.第1列由ID组成,第2列由文本组成(每个约100个单词),第3列和第4列包含标签.
现在,我想从文本列中检索单词频率(最常用的单词),并将这些频率作为额外的列添加到数据框中.我希望列名称是单词本身以及文本中填充其频率(从0到...每个文本)的列.
我尝试了tm包的一些功能,但直到现在还不尽如人意.有谁知道如何处理这个问题或从哪里开始?有没有可以完成这项工作的包裹?
id texts label1 label2
Run Code Online (Sandbox Code Playgroud)
那么让我们解决问题......
我猜你有一个看起来像这样的data.frame:
person sex adult state code
1 sam m 0 Computer is fun. Not too fun. K1
2 greg m 0 No it's not, it's dumb. K2
3 teacher m 1 What should we do? K3
4 sam m 0 You liar, it stinks! K4
5 greg m 0 I am telling the truth! K5
6 sally f 0 How can we be certain? K6
7 greg m 0 There is no way. K7
8 sam m 0 I distrust you. K8
9 sally f 0 What are you talking about? K9
10 researcher f 1 Shall we move on? Good then. K10
11 greg m 0 I'm hungry. Let's eat. You already? K11
Run Code Online (Sandbox Code Playgroud)
该数据集来自qdap包.获得qdap使用install.packages("qdap")
.
现在,为了制作可重复的示例,我正在讨论您的DATA
数据集,使用qdap中的数据集执行此操作.
DATA
dput(head(DATA))
Run Code Online (Sandbox Code Playgroud)
好的,现在我原来的问题,我认为wfm
会做你想要的:
freqs <- t(wfm(DATA$state, 1:nrow(DATA)))
data.frame(DATA, freqs, check.names = FALSE)
Run Code Online (Sandbox Code Playgroud)
如果你只想要顶部那么多单词使用我在这里使用的排序技术:
freqs <- t(wfm(DATA$state, 1:nrow(DATA)))
ords <- rev(sort(colSums(freqs)))[1:9] #top 9 words
top9 <- freqs[, names(ords)] #grab those columns from freqs
data.frame(DATA, top9, check.names = FALSE) #put it together
Run Code Online (Sandbox Code Playgroud)
结果如下:
> data.frame(DATA, top9, check.names = FALSE)
person sex adult state code you we what not no it's is i fun
1 sam m 0 Computer is fun. Not too fun. K1 0 0 0 1 0 0 1 0 2
2 greg m 0 No it's not, it's dumb. K2 0 0 0 1 1 2 0 0 0
3 teacher m 1 What should we do? K3 0 1 1 0 0 0 0 0 0
4 sam m 0 You liar, it stinks! K4 1 0 0 0 0 0 0 0 0
5 greg m 0 I am telling the truth! K5 0 0 0 0 0 0 0 1 0
6 sally f 0 How can we be certain? K6 0 1 0 0 0 0 0 0 0
7 greg m 0 There is no way. K7 0 0 0 0 1 0 1 0 0
8 sam m 0 I distrust you. K8 1 0 0 0 0 0 0 1 0
9 sally f 0 What are you talking about? K9 1 0 1 0 0 0 0 0 0
10 researcher f 1 Shall we move on? Good then. K10 0 1 0 0 0 0 0 0 0
11 greg m 0 I'm hungry. Let's eat. You already? K11 1 0 0 0 0 0 0 0 0
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
3622 次 |
最近记录: |