R文本挖掘 - 如何将R数据框列中的文本更改为具有字频率的多个列?

rda*_*tor 2 r text-mining tm

我有一个4列的数据框.第1列由ID组成,第2列由文本组成(每个约100个单词),第3列和第4列包含标签.

现在,我想从文本列中检索单词频率(最常用的单词),并将这些频率作为额外的列添加到数据框中.我希望列名称是单词本身以及文本中填充其频率(从0到...每个文本)的列.

我尝试了tm包的一些功能,但直到现在还不尽如人意.有谁知道如何处理这个问题或从哪里开始?有没有可以完成这项工作的包裹?

id  texts   label1    label2
Run Code Online (Sandbox Code Playgroud)

Tyl*_*ker 7

那么让我们解决问题......

我猜你有一个看起来像这样的data.frame:

       person sex adult                                 state code
1         sam   m     0         Computer is fun. Not too fun.   K1
2        greg   m     0               No it's not, it's dumb.   K2
3     teacher   m     1                    What should we do?   K3
4         sam   m     0                  You liar, it stinks!   K4
5        greg   m     0               I am telling the truth!   K5
6       sally   f     0                How can we be certain?   K6
7        greg   m     0                      There is no way.   K7
8         sam   m     0                       I distrust you.   K8
9       sally   f     0           What are you talking about?   K9
10 researcher   f     1         Shall we move on?  Good then.  K10
11       greg   m     0 I'm hungry.  Let's eat.  You already?  K11
Run Code Online (Sandbox Code Playgroud)

该数据集来自qdap包.获得qdap使用install.packages("qdap").

现在,为了制作可重复的示例,我正在讨论您的DATA数据集,使用qdap中的数据集执行此操作.

DATA
dput(head(DATA))
Run Code Online (Sandbox Code Playgroud)

好的,现在我原来的问题,我认为wfm会做你想要的:

freqs <- t(wfm(DATA$state, 1:nrow(DATA)))
data.frame(DATA, freqs, check.names = FALSE)
Run Code Online (Sandbox Code Playgroud)

如果你只想要顶部那么多单词使用我在这里使用的排序技术:

freqs <- t(wfm(DATA$state, 1:nrow(DATA)))
ords <- rev(sort(colSums(freqs)))[1:9]      #top 9 words
top9 <- freqs[, names(ords)]                #grab those columns from freqs  
data.frame(DATA, top9, check.names = FALSE) #put it together
Run Code Online (Sandbox Code Playgroud)

结果如下:

> data.frame(DATA, top9, check.names = FALSE)
       person sex adult                                 state code you we what not no it's is i fun
1         sam   m     0         Computer is fun. Not too fun.   K1   0  0    0   1  0    0  1 0   2
2        greg   m     0               No it's not, it's dumb.   K2   0  0    0   1  1    2  0 0   0
3     teacher   m     1                    What should we do?   K3   0  1    1   0  0    0  0 0   0
4         sam   m     0                  You liar, it stinks!   K4   1  0    0   0  0    0  0 0   0
5        greg   m     0               I am telling the truth!   K5   0  0    0   0  0    0  0 1   0
6       sally   f     0                How can we be certain?   K6   0  1    0   0  0    0  0 0   0
7        greg   m     0                      There is no way.   K7   0  0    0   0  1    0  1 0   0
8         sam   m     0                       I distrust you.   K8   1  0    0   0  0    0  0 1   0
9       sally   f     0           What are you talking about?   K9   1  0    1   0  0    0  0 0   0
10 researcher   f     1         Shall we move on?  Good then.  K10   0  1    0   0  0    0  0 0   0
11       greg   m     0 I'm hungry.  Let's eat.  You already?  K11   1  0    0   0  0    0  0 0   0
Run Code Online (Sandbox Code Playgroud)