Ton*_*yal 15
我不知道如何重塑一个语料库,但这将是一个很棒的功能.
我想我的方法是这样的:
使用这些包
# Load Packages
require(tm)
require(NLP)
require(openNLP)
Run Code Online (Sandbox Code Playgroud)
我将我的文本设置为句子功能如下:
convert_text_to_sentences <- function(text, lang = "en") {
# Function to compute sentence annotations using the Apache OpenNLP Maxent sentence detector employing the default model for language 'en'.
sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = lang)
# Convert text to class String from package NLP
text <- as.String(text)
# Sentence boundaries in text
sentence.boundaries <- annotate(text, sentence_token_annotator)
# Extract sentences
sentences <- text[sentence.boundaries]
# return sentences
return(sentences)
}
Run Code Online (Sandbox Code Playgroud)
我的重塑语料库功能的黑客(注意:你将失去元属性,除非你以某种方式修改这个功能并适当地复制它们)
reshape_corpus <- function(current.corpus, FUN, ...) {
# Extract the text from each document in the corpus and put into a list
text <- lapply(current.corpus, Content)
# Basically convert the text
docs <- lapply(text, FUN, ...)
docs <- as.vector(unlist(docs))
# Create a new corpus structure and return it
new.corpus <- Corpus(VectorSource(docs))
return(new.corpus)
}
Run Code Online (Sandbox Code Playgroud)
其工作原理如下:
## create a corpus
dat <- data.frame(doc1 = "Doctor Who is a British science fiction television programme produced by the BBC. The programme depicts the adventures of a Time Lord—a time travelling, humanoid alien known as the Doctor. He explores the universe in his TARDIS (acronym: Time and Relative Dimension in Space), a sentient time-travelling space ship. Its exterior appears as a blue British police box, a common sight in Britain in 1963, when the series first aired. Along with a succession of companions, the Doctor faces a variety of foes while working to save civilisations, help ordinary people, and right wrongs.",
doc2 = "The show has received recognition from critics and the public as one of the finest British television programmes, winning the 2006 British Academy Television Award for Best Drama Series and five consecutive (2005–10) awards at the National Television Awards during Russell T Davies's tenure as Executive Producer.[3][4] In 2011, Matt Smith became the first Doctor to be nominated for a BAFTA Television Award for Best Actor. In 2013, the Peabody Awards honoured Doctor Who with an Institutional Peabody \"for evolving with technology and the times like nothing else in the known television universe.\"[5]",
doc3 = "The programme is listed in Guinness World Records as the longest-running science fiction television show in the world[6] and as the \"most successful\" science fiction series of all time—based on its over-all broadcast ratings, DVD and book sales, and iTunes traffic.[7] During its original run, it was recognised for its imaginative stories, creative low-budget special effects, and pioneering use of electronic music (originally produced by the BBC Radiophonic Workshop).",
stringsAsFactors = FALSE)
current.corpus <- Corpus(VectorSource(dat))
# A corpus with 3 text documents
## reshape the corpus into sentences (modify this function if you want to keep meta data)
reshape_corpus(current.corpus, convert_text_to_sentences)
# A corpus with 10 text documents
Run Code Online (Sandbox Code Playgroud)
我的sessionInfo输出
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] NLP_0.1-0 openNLP_0.2-1 tm_0.5-9.1
loaded via a namespace (and not attached):
[1] openNLPdata_1.5.3-1 parallel_3.0.1 rJava_0.9-4 slam_0.1-29 tools_3.0.1
Run Code Online (Sandbox Code Playgroud)
openNLP
有一些重大变化.坏消息是它看起来与过去非常不同.好消息是它更灵活,您以前享受的功能仍然存在,您只需要找到它.
这会给你你想要的东西:
?Maxent_Sent_Token_Annotator
只需完成示例,您就会看到您正在寻找的功能.