如何在R tm包中显示语料库文本?

Azr*_*ael 10 r corpus tm

我是R和tm包中的新手,所以请原谅我的愚蠢问题;-)如何在R tm包中显示纯文本语料库的文本?

我在语料库中加载了一个包含323个纯文本文件的语料库:

 src <- DirSource("Korpora/technologie")
corpus <- Corpus(src)
Run Code Online (Sandbox Code Playgroud)

但是当我用语料库调用语料库时:

corpus[[1]]
Run Code Online (Sandbox Code Playgroud)

我总是得到这样的输出而不是语料库本身:

<<PlainTextDocument>>
Metadata:  7
Content:  chars: 144
Content:  chars: 141
Content:  chars: 224
Content:  chars: 75
Content:  chars: 105
Run Code Online (Sandbox Code Playgroud)

如何显示语料库的文本?

谢谢!

更新可 重复的样本:我已经尝试了内置的示例文本:

> data("crude")
> crude
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 20
> crude[1]
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 1
> crude[[1]]
<<PlainTextDocument>>
Metadata:  15
Content:  chars: 527
Run Code Online (Sandbox Code Playgroud)

如何打印文档文本?

更新2:会话信息:

> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tm_0.6-1  NLP_0.1-7

loaded via a namespace (and not attached):
[1] parallel_3.1.3 slam_0.1-32    tools_3.1.3   
Run Code Online (Sandbox Code Playgroud)

sil*_*ilo 36

这适用于我的,用最新版本的tm打印内容文本,

corpus[[1]]$content
Run Code Online (Sandbox Code Playgroud)

注意:Ricky在之前的评论中建议或多或少.对不起,我想写评论,只有我的代表只有25(需要最少50个回复评论).


Ana*_*onk 11

您可以尝试将语料库文本转换为数据框,并从数据框本身访问所需的文本.我使用内置的示例数据"crude"(来自tm包)作为示例.

data("crude")
dataframe<-data.frame(text=unlist(sapply(crude, `[`, "content")), stringsAsFactors=F)

dataframe[1,]
[1] "Diamond Shamrock Corp said that\neffective today it had cut its contract prices for crude oil by\n1.50 dlrs a barrel.\n    The reduction brings its posted price for West Texas\nIntermediate to 16.00 dlrs a barrel, the copany said.\n    \"The price reduction today was made in the light of falling\noil product prices and a weak crude oil market,\" a company\nspokeswoman said.\n    Diamond is the latest in a line of U.S. oil companies that\nhave cut its contract, or posted, prices over the last two days\nciting weak oil markets.\n Reuter"
Run Code Online (Sandbox Code Playgroud)


S. *_*awi 8

这是一种显示语料库文本的简单直接方法:

strwrap(corpus[[1]])
Run Code Online (Sandbox Code Playgroud)

对于原始数据,这将输出

[1] "Diamond Shamrock Corp said that effective today it had cut its contract"      
[2] "prices for crude oil by 1.50 dlrs a barrel.  The reduction brings its posted" 
[3] "price for West Texas Intermediate to 16.00 dlrs a barrel, the copany said."   
[4] "\"The price reduction today was made in the light of falling oil product"     
[5] "prices and a weak crude oil market,\" a company spokeswoman said.  Diamond is"
[6] "the latest in a line of U.S. oil companies that have cut its contract, or"    
[7] "posted, prices over the last two days citing weak oil markets.  Reuter"
Run Code Online (Sandbox Code Playgroud)