将字符串分隔成行，除非在分隔符组之间

Question

将字符串分隔成行，除非在分隔符组之间

我有带有注释符号的话语：

\n

utt <- c("\xe2\x86\x91hey girls\xe2\x86\x91 can I <join yo:u>", "((v: grunts))", "!damn shit! got it", \n"I mean /yeah we saw each other at a party:/\xe2\x86\x93 the other day"\n)\n

Run Code Online (Sandbox Code Playgroud)\n

我需要拆分utt成单独的单词，除非这些单词被某些分隔符括起来，包括此类[(/\xe2\x89\x88\xe2\x86\x91\xc2\xa3<>\xc2\xb0!]。我对s 使用双负前瞻做得相当好utt，其中分隔符之间只出现一个这样的字符串；但当分隔符之间有多个此类字符串时，我无法正确分割：

\n

library(tidyr)\nlibrary(dplyr)\ndata.frame(utt2) %>%\n  separate_rows(utt, sep = "(?!.*[(/\xe2\x89\x88\xe2\x86\x91\xc2\xa3<>\xc2\xb0!].*)\\\\s(?!.*[)/\xe2\x89\x88\xe2\x86\x91\xc2\xa3<>\xc2\xb0!])")\n# A tibble: 9 \xc3\x97 1\n  utt2                                        \n  <chr>                                       \n1 \xe2\x86\x91hey girls\xe2\x86\x91 can I <join yo:u>               \n2 ((v: grunts))                               \n3 !damn shit!                                 \n4 got                                         \n5 it                                          \n6 I mean /yeah we saw each other at a party:/\xe2\x86\x93\n7 the                                         \n8 other                                       \n9 day \n

Run Code Online (Sandbox Code Playgroud)\n

预期结果是：

\n

1 \xe2\x86\x91hey girls\xe2\x86\x91 \n2 can\n3 I\n4 <join yo:u>               \n5 ((v: grunts))                               \n6 !damn shit!                                 \n7 got                                         \n8 it                                          \n9 I\n10 mean \n11 /yeah we saw each other at a party:/\xe2\x86\x93\n12 the                                         \n13 other                                       \n14 day \n

Run Code Online (Sandbox Code Playgroud)\n

Answer 1

Wik*_*żew 5

您可以使用

\n

data.frame(utt2) %>% separate_rows(utt2, sep = "(?:([/\xe2\x89\x88\xe2\x86\x93\xc2\xa3\xc2\xb0!\xe2\x86\x91]).*?\\\\1|\\\\([^()]*\\\\)|<[^<>]*>)(*SKIP)(*F)|\\\\s+")\n

Run Code Online (Sandbox Code Playgroud)\n

请参阅正则表达式演示。

\n

请注意，在您的情况下，存在配对的字符（如(和)、<和>）和非配对的字符（如\xe2\x86\x91、\xc2\xa3）。它们需要模式中反映的不同处理。

\n

细节：

\n

(?:([/\xe2\x89\x88\xe2\x86\x93\xc2\xa3\xc2\xb0!\xe2\x86\x91]).*?\\\\1|\\\\([^()]*\\\\)|<[^<>]*>)(*SKIP)(*F)匹配\n
- ([/\xe2\x89\x88\xe2\x86\x93\xc2\xa3\xc2\xb0!\xe2\x86\x91]).*?\\1|-捕获到组 1 中的/、\xe2\x89\x88、\xe2\x86\x91、\xc2\xa3或char，然后是除换行符之外的任何零个或多个字符，尽可能少（请参见），然后捕获到组 1 中的相同\xc2\xb0字符!.*?
- \\([^()]*\\)|- 、除(之外的零个或多个字符，然后是一个字符，或者())
- <[^<>]*>- ，除和<之外的零个或多个字符，然后是一个字符<>>
- (*SKIP)(*F)- 跳过匹配的文本并从失败位置重新开始新的搜索
\n
|- 或者
\\s+- 任何其他上下文中的一个或多个空格。

\n

归档时间：	4 年，2 月前
查看次数：	71 次
最近记录：	4 年，2 月前