分裂音箱和RStudio中的对话

ero*_*oar 6 r text-mining

我有以下文件:

总统诺伯特·拉姆特博士:我宣布会议开幕.

我现在请Bundesminister Alexander Dobrindt发言.

(CDU/CSU的掌声和社民党的代表)

运输和数字基础设施部长Alexander Dobrindt:

女士们,先生们.我们今天将开始对有史以来最大的基础设施投资,超过2700亿欧元,超过1000个项目和明确的融资视角.

(Volker Kauder [CDU/CSU]:Genau!)

(CDU/CSU和SPD的掌声)

当我阅读那些.txt文档时,我想创建一个第二列,指示说话者姓名.

所以我尝试的是首先创建一个包含所有可能名称的列表并替换它们.

library(qdap)

members <- c("Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","President Dr. Norbert Lammert:")
members_r <- c("@Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","@President Dr. Norbert Lammert:")

prok <- scan(".txt", what = "character", sep = "\n")
prok <- mgsub(members,members_r,prok)

prok <- as.data.frame(prok)
prok$speaker <- grepl("@[^\\@:]*:",prok$prok, ignore.case = T)
Run Code Online (Sandbox Code Playgroud)

我的计划是获取@和之间的名称:通过正则表达式,如果说话者==真并向下应用它,直到有一个不同的名称(并明显删除所有的掌声/喊叫括号),但这也是我不知道如何我可以做到.

Mar*_*son 1

这是一种严重依赖于 的方法dplyr

首先,我在示例文本中添加了一个句子,以说明为什么我们不能仅使用冒号来识别说话者姓名。

sampleText <-
"President Dr. Norbert Lammert: I declare the session open.

I will now give the floor to Bundesminister Alexander Dobrindt.

(Applause of CDU/CSU and delegates of the SPD)

Alexander Dobrindt, Minister for Transport and Digital Infrastructure:

Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective.

(Volker Kauder [CDU/CSU]: Genau!)

(Applause of the CDU/CSU and the SPD)

This sentence right here: it is an example of a problem"
Run Code Online (Sandbox Code Playgroud)

然后,我分割文本以模拟您正在阅读的格式(这也将每个演讲放在列表的一部分中)。

splitText <- strsplit(sampleText, "\n")
Run Code Online (Sandbox Code Playgroud)

然后,我将拉出所有潜在的发言者(冒号之前的任何内容)

allSpeakers <- lapply(splitText, function(thisText){
  grep(":", thisText, value = TRUE) %>%
    gsub(":.*", "", .) %>%
    gsub("\\(", "", .)
}) %>%
  unlist() %>%
  unique()
Run Code Online (Sandbox Code Playgroud)

这给了我们:

[1] "President Dr. Norbert Lammert"                                        
[2] "Alexander Dobrindt, Minister for Transport and Digital Infrastructure"
[3] "Volker Kauder [CDU/CSU]"                                              
[4] "This sentence right here" 
Run Code Online (Sandbox Code Playgroud)

显然,最后一个不是合法的名字,因此应该从我们的发言者名单中排除:

legitSpeakers <-
  allSpeakers[-4]
Run Code Online (Sandbox Code Playgroud)

现在,我们准备好完成演讲。我在下面添加了逐步注释,而不是在此处以文字描述

speechText <- lapply(splitText, function(thisText){

  # Remove applause and interjections (things in parentheses)
  # along with any blank lines; though you could leave blanks if you want
  cleanText <-
    grep("(^\\(.*\\)$)|(^$)", thisText
         , value = TRUE, invert = TRUE)

  # Split each line by a semicolor
  strsplit(cleanText, ":") %>%
    lapply(function(x){
      # Check if the first element is a legit speaker
      if(x[1] %in% legitSpeakers){
        # If so, set the speaker, and put the statement in a separate portion
        # taking care to re-collapse any breaks caused by additional colons
        out <- data.frame(speaker = x[1]
                          , text = paste(x[-1], collapse = ":"))
      } else{
        # If not a legit speaker, set speaker to NA and reset text as above
        out <- data.frame(speaker = NA
                          , text = paste(x, collapse = ":"))
      }
      # Return whichever version we made above
      return(out)
    }) %>%
    # Bind all of the rows together
    bind_rows %>%
    # Identify clusters of speech that go with a single speaker
    mutate(speakingGroup = cumsum(!is.na(speaker))) %>%
    # Group by those clusters
    group_by(speakingGroup) %>%
    # Collapse that speaking down into a single row
    summarise(speaker = speaker[1]
              , fullText = paste(text, collapse = "\n"))
})
Run Code Online (Sandbox Code Playgroud)

这产生

[[1]]

speakingGroup  speaker                                                                fullText                                                                                                                                                                                                                                        

            1  President Dr. Norbert Lammert                                          I declare the session open.\nI will now give the floor to Bundesminister Alexander Dobrindt.                                                                                                                                                     
            2  Alexander Dobrindt, Minister for Transport and Digital Infrastructure  Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective.\nThis sentence right here: it is an example of a problem
Run Code Online (Sandbox Code Playgroud)

如果您希望单独显示每一行文本,请将summarise末尾的 替换为mutate(speaker = speaker[1]),您将为语音的每一行得到一行,如下所示:

speaker                                                                text                                                                                                                                                                                      speakingGroup
President Dr. Norbert Lammert                                          I declare the session open.                                                                                                                                                                           1
President Dr. Norbert Lammert                                          I will now give the floor to Bundesminister Alexander Dobrindt.                                                                                                                                       1
Alexander Dobrindt, Minister for Transport and Digital Infrastructure                                                                                                                                                                                                        2
Alexander Dobrindt, Minister for Transport and Digital Infrastructure  Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective.              2
Alexander Dobrindt, Minister for Transport and Digital Infrastructure  This sentence right here: it is an example of a problem                                                                                                                                               2
Run Code Online (Sandbox Code Playgroud)