Tha*_*uys 7 spell-checking r stemming
我有大约 140 万个文档,每个文档的平均字符数为(中位数:250 和平均值:470)。
我想在对它们进行分类之前执行拼写检查和词干提取。
模拟文件:
sentence <- "We aree drivng as fast as we drove yestrday or evven fastter zysxzw" %>%
rep(times = 6) %>%
paste(collapse = " ")
nchar(sentence)
[1] 407
Run Code Online (Sandbox Code Playgroud)
函数首先执行拼写检查,然后进行词干提取
library(hunspell)
library(magrittr)
spellAndStem <- function(sent, language = "en_US"){
words <- sentence %>%
strsplit(split = " ") %>%
unlist
# spelling
correct <- hunspell_check(
words = words,
dict = dictionary(language)
)
words[!correct] %<>%
hunspell_suggest(dict = language) %>%
sapply(FUN = "[", 1)
# stemming
words %>%
hunspell_stem(dict = dictionary(language)) %>%
unlist %>%
paste(collapse = " ")
}
Run Code Online (Sandbox Code Playgroud)
我看着到hunspell()功能给在文档作为一个整体的性能提升,但我不看我怎么能做到拼写检查和在该序列所产生。
时间测量:
> library(microbenchmark)
> microbenchmark(spellAndStem(sentence), times = 100)
Unit: milliseconds
expr min lq mean median uq max neval
spellAndStem(sentence) 680.3601 689.8842 700.7957 694.3781 702.7493 798.9544 100
Run Code Online (Sandbox Code Playgroud)
每个文档 0.7 秒,计算需要 0.7*1400000/3600/24 = 11.3 天。
题:
如何优化此性能?
最后备注:
目标语言为 98% 的德语和 2% 的英语。不确定信息是否重要,只是为了完整性。
您可以通过对词汇表而不是文档中的所有单词执行昂贵的步骤来显着优化您的代码。该quanteda包提供了一个非常有用的对象类或称为tokens:
toks <- quanteda::tokens(sentence)
unclass(toks)
#> $text1
#> [1] 1 2 3 4 5 4 6 7 8 9 10 11 12 1 2 3 4 5 4 6 7 8 9 10 11
#> [26] 12 1 2 3 4 5 4 6 7 8 9 10 11 12 1 2 3 4 5 4 6 7 8 9 10
#> [51] 11 12 1 2 3 4 5 4 6 7 8 9 10 11 12 1 2 3 4 5 4 6 7 8 9
#> [76] 10 11 12
#>
#> attr(,"types")
#> [1] "We" "aree" "drivng" "as" "fast" "we"
#> [7] "drove" "yestrday" "or" "evven" "fastter" "zysxzw"
#> attr(,"padding")
#> [1] FALSE
#> attr(,"what")
#> [1] "word"
#> attr(,"ngrams")
#> [1] 1
#> attr(,"skip")
#> [1] 0
#> attr(,"concatenator")
#> [1] "_"
#> attr(,"docvars")
#> data frame with 0 columns and 1 row
Run Code Online (Sandbox Code Playgroud)
如您所见,文本被拆分为词汇 ( types) 和单词的位置。我们可以使用它来优化您的代码,方法types是对文本而不是整个文本执行所有步骤:
spellAndStem_tokens <- function(sent, language = "en_US") {
sent_t <- quanteda::tokens(sent)
# extract types to only work on them
types <- quanteda::types(sent_t)
# spelling
correct <- hunspell_check(
words = as.character(types),
dict = hunspell::dictionary(language)
)
pattern <- types[!correct]
replacement <- sapply(hunspell_suggest(pattern, dict = language), FUN = "[", 1)
types <- stringi::stri_replace_all_fixed(
types,
pattern,
replacement,
vectorize_all = FALSE
)
# stemming
types <- hunspell_stem(types, dict = dictionary(language))
# replace original tokens
sent_t_new <- quanteda::tokens_replace(sent_t, quanteda::types(sent_t), as.character(types))
sent_t_new <- quanteda::tokens_remove(sent_t_new, pattern = "NULL", valuetype = "fixed")
paste(as.character(sent_t_new), collapse = " ")
}
Run Code Online (Sandbox Code Playgroud)
我正在使用该bench包进行基准测试,因为它还会检查两个函数的结果是否相同,并且我发现它通常更舒适:
res <- bench::mark(
spellAndStem(sentence),
spellAndStem_tokens(sentence)
)
res
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 spellAndStem(sentence) 807ms 807ms 1.24 259KB 0
#> 2 spellAndStem_tokens(sentence) 148ms 150ms 6.61 289KB 0
summary(res, relative = TRUE)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 spellAndStem(sentence) 5.44 5.37 1 1 NaN
#> 2 spellAndStem_tokens(sentence) 1 1 5.33 1.11 NaN
Run Code Online (Sandbox Code Playgroud)
新功能比原始功能快 5.44 倍。请注意,输入文本越大,差异变得越明显:
sentence <- "We aree drivng as fast as we drove yestrday or evven fastter zysxzw" %>%
rep(times = 600) %>%
paste(collapse = " ")
res_big <- bench::mark(
spellAndStem(sentence),
spellAndStem_tokens(sentence)
)
res_big
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 spellAndStem(sentence) 1.27m 1.27m 0.0131 749.81KB 0
#> 2 spellAndStem_tokens(sentence) 178.26ms 182.12ms 5.51 1.94MB 0
summary(res_big, relative = TRUE)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 spellAndStem(sentence) 428. 419. 1 1 NaN
#> 2 spellAndStem_tokens(sentence) 1 1 420. 2.65 NaN
Run Code Online (Sandbox Code Playgroud)
如您所见,处理 100 倍大样本所需的时间与处理较小样本所需的时间几乎相同。这是因为两者之间的词汇量完全相同。假设这个更大的样本代表您的 100 个文档,我们可以从这个结果推断到您的整个数据集。该函数应该需要不到一个小时(0.17826 * 14000 / 3600 = 0.69),但计算确实不完美,因为在真实数据上运行它所需的实际时间几乎完全取决于词汇量的大小。
除了编程/性能方面,我还有一些可能不适用于您的特定情况的问题:
sapply(as.list(sent_t_new), paste, collapse = " ")因为这不会将所有文档折叠成一个长字符串,而是将它们分开。hunspell找不到任何建议的字词。我复制了这种方法(请参阅tokens_remove命令),但您可能需要考虑至少输出丢弃的单词而不是静默删除它们。spacyr)或干脆将其关闭,因为词干很少改善德语的结果。