R将语料库分成句子

Question

R将语料库分成句子

Hen*_*enk 12 split r sentence tm qdap

我有许多PDF文档,我已将其读入带库的语料库中tm.如何将语料库分解成句子？
可以readLines通过sentSplit从包qdap[*] 读取文件来完成.该功能需要数据帧.它还需要放弃语料库并单独阅读所有文件.
如何在语料库中传递函数sentSplit{ qdap} tm？或者,还有更好的方法？.

注意:sentDetect 库中有一个函数,openNLP现在是Maxent_Sent_Token_Annotator- 同样的问题适用:如何将它与语料库[tm]结合起来？

Answer 1

Ton*_*yal 15

我不知道如何重塑一个语料库,但这将是一个很棒的功能.

我想我的方法是这样的:

使用这些包

# Load Packages
require(tm)
require(NLP)
require(openNLP)

Run Code Online (Sandbox Code Playgroud)

我将我的文本设置为句子功能如下:

convert_text_to_sentences <- function(text, lang = "en") {
  # Function to compute sentence annotations using the Apache OpenNLP Maxent sentence detector employing the default model for language 'en'. 
  sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = lang)

  # Convert text to class String from package NLP
  text <- as.String(text)

  # Sentence boundaries in text
  sentence.boundaries <- annotate(text, sentence_token_annotator)

  # Extract sentences
  sentences <- text[sentence.boundaries]

  # return sentences
  return(sentences)
}

Run Code Online (Sandbox Code Playgroud)

我的重塑语料库功能的黑客(注意:你将失去元属性,除非你以某种方式修改这个功能并适当地复制它们)

reshape_corpus <- function(current.corpus, FUN, ...) {
  # Extract the text from each document in the corpus and put into a list
  text <- lapply(current.corpus, Content)

  # Basically convert the text
  docs <- lapply(text, FUN, ...)
  docs <- as.vector(unlist(docs))

  # Create a new corpus structure and return it
  new.corpus <- Corpus(VectorSource(docs))
  return(new.corpus)
}

Run Code Online (Sandbox Code Playgroud)

其工作原理如下:

## create a corpus
dat <- data.frame(doc1 = "Doctor Who is a British science fiction television programme produced by the BBC. The programme depicts the adventures of a Time Lord—a time travelling, humanoid alien known as the Doctor. He explores the universe in his TARDIS (acronym: Time and Relative Dimension in Space), a sentient time-travelling space ship. Its exterior appears as a blue British police box, a common sight in Britain in 1963, when the series first aired. Along with a succession of companions, the Doctor faces a variety of foes while working to save civilisations, help ordinary people, and right wrongs.",
                  doc2 = "The show has received recognition from critics and the public as one of the finest British television programmes, winning the 2006 British Academy Television Award for Best Drama Series and five consecutive (2005–10) awards at the National Television Awards during Russell T Davies's tenure as Executive Producer.[3][4] In 2011, Matt Smith became the first Doctor to be nominated for a BAFTA Television Award for Best Actor. In 2013, the Peabody Awards honoured Doctor Who with an Institutional Peabody \"for evolving with technology and the times like nothing else in the known television universe.\"[5]",
                  doc3 = "The programme is listed in Guinness World Records as the longest-running science fiction television show in the world[6] and as the \"most successful\" science fiction series of all time—based on its over-all broadcast ratings, DVD and book sales, and iTunes traffic.[7] During its original run, it was recognised for its imaginative stories, creative low-budget special effects, and pioneering use of electronic music (originally produced by the BBC Radiophonic Workshop).",
                  stringsAsFactors = FALSE)

current.corpus <- Corpus(VectorSource(dat))
# A corpus with 3 text documents

## reshape the corpus into sentences (modify this function if you want to keep meta data)
reshape_corpus(current.corpus, convert_text_to_sentences)
# A corpus with 10 text documents

Run Code Online (Sandbox Code Playgroud)

我的sessionInfo输出

> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
  [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
  [1] NLP_0.1-0     openNLP_0.2-1 tm_0.5-9.1   

loaded via a namespace (and not attached):
  [1] openNLPdata_1.5.3-1 parallel_3.0.1      rJava_0.9-4         slam_0.1-29         tools_3.0.1

Run Code Online (Sandbox Code Playgroud)

@KimStacks我发现了这个问题.这是因为ggplot2和openNLP都有自己的注释方法,我在openNLP之后加载了ggplot2,以便注释对象被ggplot2屏蔽掉.尝试在ggplot2之后加载openNLP,它会没事的. (2认同)

Answer 2

Tyl*_*ker 5

openNLP有一些重大变化.坏消息是它看起来与过去非常不同.好消息是它更灵活,您以前享受的功能仍然存在,您只需要找到它.

这会给你你想要的东西:

?Maxent_Sent_Token_Annotator

只需完成示例,您就会看到您正在寻找的功能.

归档时间：	12 年，1 月前
查看次数：	17835 次
最近记录：	7 年，3 月前