删除标点但保留表情符号?

use*_*230 10 string text r emoticons gsub

是否有可能删除所有的标点符号,但保留表情符号如

:-(

:)

:d

:p

structure(list(text = structure(c(4L, 6L, 1L, 2L, 5L, 3L), .Label =     c("ãããæããããéãããæãããInappropriate announce:-(", 
"@AirAsia your direct debit (Maybank) payment gateways is not working. Is it something     you are working to fix?", 
"@AirAsia Apart from the slight delay and shortage of food on our way back from Phuket, both flights were very smooth. Kudos :)", 
"RT @AirAsia: ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í Now you can enjoy a #great :D breakfast onboard with our new breakfast meals! :D", 
"xdek ke flight @AirAsia Malaysia to LA... hahah..:p bagi la promo murah2 sikit, kompom aku beli...", 
"You know there is a problem when customer service asks you to wait for 103 minutes and your no is 42 in the queue. X-("
), class = "factor"), created = structure(c(5L, 4L, 4L, 3L, 2L, 
1L), .Label = c("1/2/2014 16:14", "1/2/2014 17:00", "3/2/2014 0:54", 
"3/2/2014 0:58", "3/2/2014 1:28"), class = "factor")), .Names = c("text", 
"created"), class = "data.frame", row.names = c(NA, -6L))
Run Code Online (Sandbox Code Playgroud)

gag*_*ews 7

1.一个有效的纯正则表达式解决方案(又名编辑#2)

这个任务可以纯粹的正则表达式(非常感谢@Mike萨穆埃尔)

首先,我们构建一个表情符号数据库:

(emots <- as.character(outer(c(":", ";", ":-", ";-"),
+                c(")", "(", "]", "[", "D", "o", "O", "P", "p"), stri_paste)))
## [1] ":)"  ";)"  ":-)" ";-)" ":("  ";("  ":-(" ";-(" ":]"  ";]"  ":-]" ";-]" ":["  ";["  ":-[" ";-[" ":D"  ";D"  ":-D" ";-D"
## [21] ":o"  ";o"  ":-o" ";-o" ":O"  ";O"  ":-O" ";-O" ":P"  ";P"  ":-P" ";-P" ":p"  ";p"  ":-p" ";-p"
Run Code Online (Sandbox Code Playgroud)

示例输入文本:

text <- ":) ;P :] :) ;D :( LOL :) I've been to... the (grocery) st{o}re :P :-) --- and the salesperson said: Oh boy!"
Run Code Online (Sandbox Code Playgroud)

一个辅助函数,它可以转义一些特殊字符,以便它们可以用于正则表达式模式(使用stringi包):

library(stringi)
escape_regex <- function(r) {
   stri_replace_all_regex(r, "\\(|\\)|\\[|\\]", "\\\\$0")
}
Run Code Online (Sandbox Code Playgroud)

一个匹配表情符号的正则表达式:

(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")"))
## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"
Run Code Online (Sandbox Code Playgroud)

现在,正如@Mike Samuel在下面提到的,我们只是匹配(emoticon)|punctuation (注意表情符号在捕获组中),然后用捕获组1的结果替换匹配(所以如果它是表情符号,我们有替换= 这个表情符号,如果它是标点字符char,我们有替换= 没有).这将起作用,因为|在ICU Regex(使用的是正则表达式引擎stri_replace_all_regex)中的交替是贪婪和左偏:表情符号将比标点符号更早匹配.

stri_replace_all_regex(text, stri_c(regex1, "|\\p{P}"), "$1")
## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-)  and the salesperson said Oh boy"
Run Code Online (Sandbox Code Playgroud)

顺便说一句,如果你只想摆脱一组选定的字符,那就把它放在上面[.,]而不是[\\p{P}]上面.

2.正则表达式解决方案提示 - 我的第一个(不明智的)尝试(又名原始答案)

我的第一个想法(这里主要是出于"历史原因")是通过使用前瞻和后视来解决这个问题,但是 - 如你所见 - 这远非完美.

要删除所有:,并;没有跟随),(,D,X,8,[,或]使用负向后看:

stri_replace_all_regex(text, "[:;](?![)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P -) --- and the salesperson said Oh boy!"
Run Code Online (Sandbox Code Playgroud)

现在我们可以添加一些老式的表情符号(有鼻子,例如:-),;-D等等)

stri_replace_all_regex(text, "[:;](?![-]?[)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P :-) --- and the salesperson said Oh boy!"
Run Code Online (Sandbox Code Playgroud)

现在删除连字符(负面看后面并向前看)

stri_replace_all_regex(text, "[:;](?![-]?[)P(DX8\\[\\]])|(?!<[:;])[-](?![)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P :-)  and the salesperson said Oh boy!"
Run Code Online (Sandbox Code Playgroud)

等等.当然,首先你应该建立自己的表情符号数据库(保持原样)和标点符号(删除).正则表达式将高度依赖于这两组,因此很难添加新的表情符号 - 它绝对不值得应用(并且可能扭曲你的大脑).

3.第二次尝试(regex-dumb friend,又名编辑#1)

另一方面,如果你对复杂的正则表达式过敏,试试这个.这种方法有一些"教学上的好处" - 我们对以下每个步骤中的操作有充分的了解:

  1. 找到所有表情符号text;
  2. 找到所有标点字符text;
  3. 找到不是表情符号部分的标点符号的位置;
  4. 删除步骤3中的字符.

示例性输入文本 - 仅1个字符串 - 广义情况作为练习;)

text <- ":) ;P :] :) ;D :( LOL :) I've been to... the (grocery) st{o}re :P :-) --- and the salesperson said: Oh boy!"
Run Code Online (Sandbox Code Playgroud)

一个辅助函数,它可以转义一些特殊字符,以便它们可以在正则表达式中使用:

escape_regex <- function(r) {
   library("stringi")
   stri_replace_all_regex(r, "\\(|\\)|\\[|\\]", "\\\\$0")
}
Run Code Online (Sandbox Code Playgroud)

一个匹配表情符号的正则表达式:

(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")"))
## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"
Run Code Online (Sandbox Code Playgroud)

找到所有表情符号的开始和结束位置(即找到第一个第二个OR ...表情符号):

where_emots <- stri_locate_all_regex(text, regex1)[[1]] # only for the first string of text
print(where_emots)
##       start end
##  [1,]     1   2
##  [2,]     4   5
##  [3,]     7   8
##  [4,]    10  11
##  [5,]    13  14
##  [6,]    16  17
##  [7,]    23  24
##  [8,]    64  65
##  [9,]    67  69
Run Code Online (Sandbox Code Playgroud)

找到所有标点符号(这\\p{P}是表示标点符号的Unicode字符类):

where_punct <- stri_locate_all_regex(text, "\\p{P}")[[1]]
print(where_punct)
##       start end
##  [1,]     1   1
##  [2,]     2   2
##  [3,]     4   4
##  [4,]     7   7
##  [5,]     8   8
## ...
## [26,]    72  72
## [27,]    73  73
## [28,]    99  99
## [29,]   107 107
Run Code Online (Sandbox Code Playgroud)

由于在表情符号中出现了一些标点字符,我们不应将它们分阶段删除:

which_punct_omit <- sapply(1:nrow(where_punct), function(i) {
   any(where_punct[i,1] >= where_emots[,1] &
        where_punct[i,2] <= where_emots[,2]) })
where_punct <- where_punct[!which_punct_omit,] # update where_punct
print(where_punct)
##       start end
##  [1,]    27  27
##  [2,]    38  38
##  [3,]    39  39
##  [4,]    40  40
##  [5,]    46  46
##  [6,]    54  54
##  [7,]    58  58
##  [8,]    60  60
##  [9,]    71  71
## [10,]    72  72
## [11,]    73  73
## [12,]    99  99
## [13,]   107 107
Run Code Online (Sandbox Code Playgroud)

每个标点符号肯定只包含1个字符,因此总是如此where_punct[,1]==where_punct[,2].

现在是最后一部分.如您所见,where_punct[,1]包含要删除的字符的位置.恕我直言,最简单的方法(没有循环)是通过将字符串转换为UTF-32(每个字符== 1整数),删除不需要的元素,然后返回到文本表示:

text_tmp <- stri_enc_toutf32(text)[[1]]
print(text_tmp) # here - just ASCII codes...
## [1]  58  41  32  59  80  32  58  93  32  58....
text_tmp <- text_tmp[-where_punct[,1]] # removal, but be sure that where_punct is not empty!
Run Code Online (Sandbox Code Playgroud)

结果是:

stri_enc_fromutf32(text_tmp)
## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-)  and the salesperson said Oh boy"
Run Code Online (Sandbox Code Playgroud)

这个给你.


Tyl*_*ker 5

这是一种不太复杂的方法,可能比@ gagolews的解决方案慢.它要求你喂它一个表情词典.您可以创建它或使用qdapDictionaries包中的那个.基本方法将表情符号转换为不会被误认为是其他任何内容的文本(我使用dat$Temp <-前缀来确保这一点).然后使用剥离标点符号qdap::strip,然后通过mgsub以下方式将占位符转换回表情符号:

library(qdap)
#reps <- emoticon
emos <- c(":-(", ":)", ":D", ":p", "X-(")
reps <- data.frame(seq_along(emos), emos)

reps[, 1] <- paste0("EMOTICONREPLACE", reps[, 1])
dat$Temp <- mgsub(as.character(reps[, 2]), reps[, 1], dat[, 1])
dat$Temp <- mgsub(reps[, 1], as.character(reps[, 2]), 
    strip(dat$Temp, digit.remove = FALSE, lower.case=FALSE))
Run Code Online (Sandbox Code Playgroud)

查看它:

truncdf(left_just(dat[, 3, drop=F]), 50)

##   Temp                                              
## 1 RT AirAsia ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í No
## 2 You know there is a problem when customer service 
## 3 ãããæããããéãããæãããInappropriate announce:-(         
## 4 AirAsia your direct debit Maybank payment gateways
## 5 xdek ke flight AirAsia Malaysia to LA hahah:p bagi
## 6 AirAsia Apart from the slight delay and shortage o
Run Code Online (Sandbox Code Playgroud)

编辑:保持?!请求传递char.keep函数中的参数strip:

dat$Temp <- mgsub(reps[, 1], as.character(reps[, 2]), 
    strip(dat$Temp, digit.remove = FALSE, lower.case=FALSE, char.keep=c("!", "?")))
Run Code Online (Sandbox Code Playgroud)