每当我运行此代码时,tm_map 行都会给我警告消息作为警告消息:在 tm_map.SimpleCorpus(docs, toSpace, "/") 中:转换丢弃文档
texts <- read.csv("./Data/fast food/Domino's/Domino's veg pizza.csv",stringsAsFactors = FALSE)
docs <- Corpus(VectorSource(texts))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removeWords, c("blabla1", "blabla2"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
Run Code Online (Sandbox Code Playgroud)
此警告仅在您用于content_transformer创建自己的特定函数时出现。它仅在您拥有基于VectorSource.
原因是底层代码有检查,看语料内容的名称个数是否与语料内容的长度匹配。将文本作为向量读取时,没有文档名称,并且会弹出此警告。这只是一个警告,没有文件被丢弃。
请参阅以下示例:
text <- c("this is my text with a forward slash / and some other text")
library(tm)
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
text <- c("this is my text with a forward slash / and some other text")
text_corpus <- Corpus(VectorSource(text))
inspect(text_corpus)
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 1
[1] this is my text with a forward slash / and some other text
# warning appears here
text_corpus <- tm_map(text_corpus, toSpace, "/")
inspect(text_corpus)
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 1
[1] this is my text with a forward slash and some other text
Run Code Online (Sandbox Code Playgroud)
您可以使用以下命令看到 text_corpus 中没有名称:
names(content(text_corpus))
NULL
Run Code Online (Sandbox Code Playgroud)
如果您不希望出现此警告,则需要创建一个 data.frame 并将其用作DataframeSource.
text <- c("this is my text with a forward slash / and some other text")
doc_ids <- c(1)
df <- data.frame(doc_id = doc_ids, text = text, stringsAsFactors = FALSE)
df_corpus <- Corpus(DataframeSource(df))
inspect(df_corpus)
# no warning appears
df_corpus <- tm_map(df_corpus, toSpace, "/")
inspect(df_corpus)
names(content(df_corpus))
"1"
Run Code Online (Sandbox Code Playgroud)