我有一些调查数据,其中项目名称是删除了空格的调查文本.我想重新添加空格.显然,这需要一些英语知识.
这是一些示例数据,但任何函数都应该适用于任意合理的句子:
x <- c("Shewrotehimalongletter,buthedidn'treadit.",
"Theshootersaysgoodbyetohislove.",
"WritingalistofrandomsentencesisharderthanIinitiallythoughtitwouldbe.",
"Letmehelpyouwithyourbaggage.",
"Pleasewaitoutsideofthehouse.",
"Iwantmoredetailedinformation.",
"Theskyisclear;thestarsaretwinkling.",
"Sometimes,allyouneedtodoiscompletelymakeanassofyourselfandlaughitofftorealisethatlifeisn’tsobadafterall.")
Run Code Online (Sandbox Code Playgroud)
这是一个答案,但更多的是"可能没有一个独特的答案"答案.
该ScrabbleScore
套餐包含2006年锦标赛单词列表,因此我将其用作我搜索的"英语单词"的近似值.
library(ScrabbleScore)
data("twl06")
Run Code Online (Sandbox Code Playgroud)
我们可以通过在该列表中查找单词来检查单词是否为"英语".
findword <- function(string) {
if (string %in% twl06) return(string) else return(1)
}
Run Code Online (Sandbox Code Playgroud)
让我们使用一个很好的模糊文本,不是吗?这个引起了一些轰动,因为它被用作Susan Boyle的专辑派对的标签
x <- c("susanalbumparty")
Run Code Online (Sandbox Code Playgroud)
我们可以检查"英语"单词的子串,并在找到单词时逐渐缩短字符串.这可以从开始或结束完成,所以我将两者都证明答案几乎不是唯一的
sentence_splitter <- function(x) {
z <- y <- x
words1 <- list()
while(nchar(z) > 1) {
while(findword(y) == 1 & nchar(y) > 1) {
y <- substr(y, 2, nchar(y))
}
if (findword(y) != 1) words1 <- append(words1, y)
y <- z <- substr(z, 1, nchar(z) - nchar(y))
}
z <- y <- x
words2 <- list()
while(nchar(z) > 1) {
while(findword(y) == 1 & nchar(y) > 1) {
y <- substr(y, 1, nchar(y) - 1)
}
if (findword(y) != 1) words2 <- append(words2, y)
y <- z <- substr(z, 1 + nchar(y), nchar(z))
}
return(list(paste(unlist(rev(words1)), collapse = " "),
paste(unlist(words2), collapse = " ")))
}
Run Code Online (Sandbox Code Playgroud)
结果:
sentence_splitter("susanalbumparty")
#> [[1]]
#> [1] "us an album party"
#>
#> [[2]]
#> [1] "us anal bump arty"
Run Code Online (Sandbox Code Playgroud)
注意:这会找到每个方向搜索的最长子字符串(因为我正在缩短字符串).你也可以通过扩展字符串来找到最短的.要正确执行此操作,您需要查看仅保留有效单词的所有"英语"子字符串.
最后,您会注意到'susan'不匹配,因为它不是此定义下的"有效英语单词".
希望这足以让你相信这不会很简单.
更新:在你的一些例子上尝试这个(一旦你tolower
删除标点符号,它实际上并没有太糟糕)...最后一个是一个doozy,但其余的似乎没关系
unlist(lapply(sub("[[:punct:]]", "", tolower(x))[1:7], sentence_splitter))
#> "she wrote him along letter the did re adit"
#> "shew rote him along letter but he did tread it"
#> "the shooter says goodbye to his love"
#> "the shooters ays goodbye to his love"
#> "writing alist of random sentence sis harder ani initially though tit would be"
#> "writing alist of randoms en ten es is harder than initially thought it would be"
#> "let me help you with your baggage"
#> "let me help you withy our baggage"
#> "please wait outside of the house"
#> "please wait outside oft heh use"
#> "want more detailed information"
#> "want more detailed information"
#> "the sky is clear the stars are twinkling"
#> "the sky is clear the stars are twinkling"
Run Code Online (Sandbox Code Playgroud)