如何计算R中的文本行？

Question

如何计算R中的文本行？

我想用R来计算不同发言者所说话的行数(这是议会发言记录的成绩单).基本文本如下:

MR. JOHN: This activity has been going on in Tororo and I took it up with the office of the DPC. He told me that he was not aware of it.
MS. SMITH: Yes, I am aware of that. 
MR. LEHMAN: Therefore, I am seeking your guidance, Madam Speaker, and requesting that you re-assign the duty.  
MR. JOHN: Thank you

Run Code Online (Sandbox Code Playgroud)

在文档中,每个发言者都有一个以MR/MS开头并始终大写的标识符.我想创建一个数据集,计算每个发言者在文档中说出的每个发言者所说的行数,以便上述文本将导致:

MR. JOHN: 2
MS. SMITH: 1
MR. LEHMAN: 2
MR. JOHN: 1

Run Code Online (Sandbox Code Playgroud)

感谢使用R的指针!

Answer 1

Aru*_*run 10

您可以使用该模式:拆分字符串,然后使用table:

table(sapply(strsplit(x, ":"), "[[", 1))
#   MR. JOHN MR. LEHMAN  MS. SMITH 
#          2          1          1

Run Code Online (Sandbox Code Playgroud)

strsplit - 将字符串拆分:并
以[[ - 选择列表的第一部分元素
- 获取频率] 生成列表

编辑:关注 OP的评论.您可以将文字保存在文本文件中,并用于readLines读取R中的文本.

tt <- readLines("./tmp.txt")

Run Code Online (Sandbox Code Playgroud)

现在,我们必须找到一种模式,通过该模式可以过滤这些文本,只显示那些正在讲话的人的名字.我可以根据你在链接的成绩单中看到的两种方法来思考.

检查一个:,然后回顾后发的:,看它是否是任何的A-Z或[:punct:](即,如果说之前发生的字符:是任何大写字母或标点符号的-这是因为他们中的一些有)前:).

您可以使用strsplit后跟sapply(如下所示)

使用strsplit:

# filter tt by pattern
tt.f <- tt[grepl("(?<=[A-Z[:punct:]]):", tt, perl = TRUE)]
# Now you should only have the required lines, use the command above:

out <- table(sapply(strsplit(tt.f, ":"), "[[", 1))

Run Code Online (Sandbox Code Playgroud)

还有其他方法(使用gsubex :)或替代模式.但这应该让你了解这种方法.如果模式应该不同,那么您应该更改它以捕获所有必需的行.

当然,这假设没有其他行,例如,像这样:

"Mr. Chariman, whatever (bla bla): It is not a problem"

Run Code Online (Sandbox Code Playgroud)

因为我们的模式会给出TRUE ):.如果在文本中发生这种情况,您将不得不找到更好的模式.

+`t`.在你清醒时发帖是没有意义的.你什么时候睡觉？ (2认同)

归档时间：	12 年，10 月前
查看次数：	1996 次
最近记录：	12 年，10 月前