che*_*kok 5 spell-checking r hunspell
我目前正在处理一个大型数据框,每行包含大量文本,并且希望使用包有效地识别和替换每个句子中拼写错误的单词hunspell。我能够识别出拼写错误的单词,但不知道如何hunspell_suggest在列表上进行操作。
这是数据框的示例:
df1 <- data.frame("Index" = 1:7, "Text" = c("A complec sentence joins an independet",
"Mary and Samantha arived at the bus staton before noon",
"I did not see thm at the station in the mrning",
"The participnts read 60 sentences in radom order",
"how to fix mispelled words in R languge",
"today is Tuesday",
"bing sports quiz"))
Run Code Online (Sandbox Code Playgroud)
我将文本列转换为字符,并用于hunspell识别每行中拼写错误的单词。
library(hunspell)
df1$Text <- as.character(df1$Text)
df1$word_check <- hunspell(df1$Text)
Run Code Online (Sandbox Code Playgroud)
我试过
df1$suggest <- hunspell_suggest(df1$word_check)
Run Code Online (Sandbox Code Playgroud)
但它一直给出这个错误:
Error in hunspell_suggest(df1$word_check) :
is.character(words) is not TRUE
Run Code Online (Sandbox Code Playgroud)
我对此很陌生,所以我不太确定使用函数的建议列hunspell_suggest会如何。任何帮助将不胜感激。
检查您的中间步骤。的输出df1$word_check如下:
List of 5
$ : chr [1:2] "complec" "independet"
$ : chr [1:2] "arived" "staton"
$ : chr [1:2] "thm" "mrning"
$ : chr [1:2] "participnts" "radom"
$ : chr [1:2] "mispelled" "languge"
Run Code Online (Sandbox Code Playgroud)
这是类型list. 如果您这样做了,lapply(df1$word_check, hunspell_suggest)您可以获得建议。
编辑
我决定更详细地讨论这个问题,因为我没有看到任何简单的替代方案。这就是我想出的:
cleantext = function(x){
sapply(1:length(x),function(y){
bad = hunspell(x[y])[[1]]
good = unlist(lapply(hunspell_suggest(bad),`[[`,1))
if (length(bad)){
for (i in 1:length(bad)){
x[y] <<- gsub(bad[i],good[i],x[y])
}}})
x
}
Run Code Online (Sandbox Code Playgroud)
尽管可能有一种更优雅的方法,但此函数返回一个修正后的字符串向量,如下所示:
> df1$Text
[1] "A complec sentence joins an independet"
[2] "Mary and Samantha arived at the bus staton before noon"
[3] "I did not see thm at the station in the mrning"
[4] "The participnts read 60 sentences in radom order"
[5] "how to fix mispelled words in R languge"
[6] "today is Tuesday"
[7] "bing sports quiz"
> cleantext(df1$Text)
[1] "A complex sentence joins an independent"
[2] "Mary and Samantha rived at the bus station before noon"
[3] "I did not see them at the station in the morning"
[4] "The participants read 60 sentences in radon order"
[5] "how to fix misspelled words in R language"
[6] "today is Tuesday"
[7] "bung sports quiz"
Run Code Online (Sandbox Code Playgroud)
请注意,因为这会返回hunspell- 给出的第一个建议,该建议可能正确,也可能不正确。