ale*_*lex 6 regex r text-mining strsplit
想text在句子中分割出一个字符元素的向量.分裂标准有多种模式("and/ERT","/$").也有例外(:/$.,and/ERT then,./$. Smiley)自该模式.
尝试:匹配拆分应该的情况."^&*"在该位置插入一个不寻常的图案().strsplit具体模式
问题:我不知道如何正确处理异常.有明确的情况"^&*"应该消除异常模式()并在运行之前恢复原始文本strsplit.
码:
text <- c("This are faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!",
"This are the same faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!",
"Like above the same faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!")
patternSplit <- c("and/ERT", "/\\$") # The class of split-cases is much larger then in this example. Therefore it is not possible to adress them explicitly.
patternSplit <- paste("(", paste(patternSplit, collapse = "|"), ")", sep = "")
exceptionsSplit <- c("\\:/\\$\\.", "and/ERT then", "\\./\\$\\. Smiley")
exceptionsSplit <- paste("(", paste(exceptionsSplit, collapse = "|"), ")", sep = "")
# If you don't have exceptions, it works here. Unfortunately it splits "*$/*" into "*" and "$/*". Would be convenient to avoid this. See example "ideal" split below.
textsplitted <- strsplit(gsub(patternSplit, "^&*\\1", text), "^&*", fixed = TRUE) #
# Ideal split:
textsplitted
> textsplitted
[[1]]
[1] "This are faulty propositions one and/ERT"
[2] "two ,/$,"
[3] "which I want to split ./$."
[4] "There are cases where I explicitly want and/ERT"
[5] "some where I don't want to split ./$."
[6] "For example :/$. when there is an and/ERT then I don't want to split ./$."
[7] "This is also one case where I dont't want to split ./$. Smiley !/$."
[8] "Thank you ./$!"
[[2]]
[1] "This are the same faulty propositions one and/ERT
[2] "two ,/$,"
#...
# This try doesen't work!
text <- gsub(patternSplit, "^&*\\1", text)
text <- gsub(exceptionsSplit, "[original text without "^&*"]", text)
textsplitted <- strsplit(text, "^&*", fixed = TRUE)
Run Code Online (Sandbox Code Playgroud)
我想你可以使用这个表达式来获得你想要的分裂.当strsplit用完它分割的字符时,你必须在要匹配的东西之后的空格上分开/不匹配(这是你在OP中所需输出中所拥有的):
strsplit( text[[1]] , "(?<=and/ERT)\\s(?!then)|(?<=/\\$[[:punct:]])(?<!:/\\$[[:punct:]])\\s(?!Smiley)" , perl = TRUE )
#[[1]]
#[1] "This are faulty propositions one and/ERT"
#[2] "two ,/$,"
#[3] "which I want to split ./$."
#[4] "There are cases where I explicitly want and/ERT"
#[5] "some where I don't want to split ./$."
#[6] "For example :/$. when there is an and/ERT then I don't want to split ./$."
#[7] "This is also one case where I dont't want to split ./$. Smiley !/$."
#[8] "Thank you ./$!"
Run Code Online (Sandbox Code Playgroud)
(?<=and/ERT)\\s -上的空间分割,\\s即IS之前,(?<=...)通过"and/ERT"(?!then) - 但是,如果没有遵循该空间,请(?!...)通过"then"| - OR运算符链接下一个表达式(?<=/\\$[[:punct:]]) - 积极的后视断言,"/$"随后是任何标点符号(?<!:/\\$[[:punct:]])\\s(?!Smiley)-匹配在一个空间NOT前面加上":/$"[[:punct:]]根据前一个点(但IS由前面"/$[[:punct:]]"但NOT接着,(?!...)通过"Smiley"| 归档时间: |
|
| 查看次数: |
1854 次 |
| 最近记录: |