ale*_*lex 2 regex r text-mining strsplit
你如何只提取/以下大写字母和整个字母[[:punct:]]/$[[:punct:]].
text <- c("This/ART ,/$; Is/NN something something/else A/VAFIN faulty/ADV text/ADV which/ADJD i/PWS propose/ADV as/APPR Example/NE ./$. So/NE It/PTKNEG makes/ADJD no/VAFIN sense/ADV at/KOUS all/PDAT ,/$, it/APPR Has/ADJA Errors/NN ,/$; and/APPR it/APPR is/CARD senseless/NN again/ART ./$:")
# HOW to?
textPOS <- strsplit(text,"( )|(?<=[[:punct:]]/\\$[[:punct:]])", perl=TRUE)
# ^^^
# extract only the "/" with the following capital letters
# and the whole "[[:punct:]]/$[[:punct:]]"
# Expected RETURN:
> textPOS
[1] "/ART" ",/$;" "/NN" "/VAFIN" "/ADV" "/ADV" "/ADJD" "/PWS" "/ADV" "/APPR" "/NE" "./$." "/NE" "/PTKNEG" "/ADJD" "/VAFIN" "/ADV" "/KOUS" "/PDAT" ",/$," "/APPR" "/ADJA" "/NN" ",/$;" "/APPR" "/APPR" "/CARD" "/NN" "/ART" "./$:"
Run Code Online (Sandbox Code Playgroud)
谢谢!:)
You can use gregexpr and regmatches:
regmatches(text, gregexpr('[[:punct:]]*/[[:alpha:][:punct:]]*', text))
# [[1]]
# [1] "/ART" "/NN" "/VAFIN" "/ADV" "/ADV" "/ADJD" "/PWS" "/ADV" "/APPR" "/NE" "./$." "/NE"
# [13] "/PTKNEG" "/ADJD" "/VAFIN" "/ADV" "/KOUS" "/PDAT" ",/$," "/APPR" "/ADJA" "/NN" ",/$;" "/APPR"
# [25] "/APPR" "/CARD" "/NN" "/ART" "./$:"
Run Code Online (Sandbox Code Playgroud)
In words the regex says: "find things that start with zero or more punctuation marks followed by a slash followed by one or more letters or punctuation. If you want to include numbers switch to [:alnum:].
Per comments, if you want only uppercase letters the regex would become:
regmatches(text, gregexpr('[[:punct:]]*/[[:upper:][:punct:]]*', text))
Run Code Online (Sandbox Code Playgroud)
作为@eddi建议,[A-Z]并[:upper:]大致相当.再次像@eddi建议的那样,这个正则表达式将捕获//LETTERS案例以及/ $ punct案例:
/[A-Z]+|[[:punct:]]/\\$[[:punct:]]
Run Code Online (Sandbox Code Playgroud)