非贪婪的gsub

leo*_*oce 5 regex r gsub

我有一个日志数据集:

V1  duration  id  startpoint
T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512  7771    1   2012-05-07_12-29-51
T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360????[=]C<=>360.cn 7771    1   2012-05-07_12-29-51
T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501    7771    1   2012-05-07_12-29-51
T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804  7771    1   2012-05-07_12-29-51 211
Run Code Online (Sandbox Code Playgroud)

我正在尝试从第一列(时间点,进程,pid,url等)中提取信息。一开始我尝试过:

df$timepoint <- gsub("T<=>(.*)[=].*", "\\1", df$V1)
Run Code Online (Sandbox Code Playgroud)

它返回类似的内容161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<,然后我尝试了:

df$timepoint <- gsub("T<=>([0-9]*).*", "\\1", df$V1)
Run Code Online (Sandbox Code Playgroud)

它有效,但是在处理诸如流程名称之类的文本时将不起作用,因此我搜索了“正则表达式最小匹配”并找到了术语non-greedy。我再次尝试:

df$timepoint <- gsub("T<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$process <- gsub(".*P<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$pid <- gsub(".*I<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$url <- gsub(".*U<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$addr <- gsub(".*A<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$tab <- gsub(".*B<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$ver <- gsub(".*V<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$window <- gsub(".*W<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$name <- gsub(".*N<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$company <- gsub(".*C<=>(.*?)", "\\1", df$V1)
Run Code Online (Sandbox Code Playgroud)

并非每一行都包含所有信息,并且发生了问题。如果没有有关软件名称或公司名称的信息,R只需将V1复制到新的变量中。如果软件版本信息在V1的末尾,则正则表达式".*V<=>(.*?)\\[=\\].*"还将整个字符串复制到新的var中:

V1  duration  id  startpoint  timepoint process pid url addr  tab ver window  name  company
T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512  7771    1   2012-05-07_12-29-51 161 explorer.exe    1820    T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512  T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512  T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512  T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512  20094   T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512  T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512
T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360????[=]C<=>360.cn 7771    1   2012-05-07_12-29-51 195 360Safe.exe 1732    T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360????[=]C<=>360.cn T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360????[=]C<=>360.cn T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360????[=]C<=>360.cn 7, 5, 0, 1501   1017e   360???? 360.cn
T<=>203[=]P<=>360chrome.exe[=]I<=>436[=]U<=>NULL[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804[=]N<=>360?????[=]C<=>360.cn    7771    1   2012-05-07_12-29-51 203 360chrome.exe   436 NULL    2027a   20290   5.2.0.804   T<=>203[=]P<=>360chrome.exe[=]I<=>436[=]U<=>NULL[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804[=]N<=>360?????[=]C<=>360.cn    360?????    360.cn
T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501    7771    1   2012-05-07_12-29-51 209 360Safe.exe 1732    T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501    T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501    T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501    T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501    1017e   T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501    T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501
T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804  7771    1   2012-05-07_12-29-51 211 360chrome.exe   436 www.hao123.com  2027a   20290   T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804  T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804  T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804  T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804
Run Code Online (Sandbox Code Playgroud)

我以为如果R无法找到'C <=>'(例如),那之后就没有(。*?)了。这将是一个空字符串,但输出使用了整个字符串。有人可以帮我修复它吗?谢谢!

更新资料

多亏MrFlick的评论,我才得到一个基于以下答案的解决方案:

以提取软件名称信息的过程为例,

ind1 <- grep(".*N<=>(.*?)\\[=\\].*", df$V1, value= FALSE) # see if pattern exists with follow-up
ind2 <- grep(".*N<=>(.*?)", df$V1, value= FALSE) # see if pattern exists
df$name <- "" 
df$name[ind2] <- gsub(".*N<=>(.*?)", "\\1", df$V1) # replace the ones with pattern match
df$name[ind1] <- gsub(".*N<=>(.*?)\\[=\\].*", "\\1", df$V1) # replace the ones with pattern match and follow-up
Run Code Online (Sandbox Code Playgroud)

但是此代码片段似乎很糟糕,如果这是最终的解决方案,那么我必须与其他信息(流程,pid,版本,公司等)一起使用它……有人可以帮助对其进行优化吗?谢谢!

MrF*_*ick 3

这是另一个策略。我们可以使用它gregexpr来分离堆叠数据的每个部分。这是向量中的数据

\n\n
V1<-c("T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512", \n"T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360\xe5\xae\x89\xe5\x85\xa8\xe5\x8d\xab\xe5\xa3\xab[=]C<=>360.cn", \n"T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501", \n"T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804")\n
Run Code Online (Sandbox Code Playgroud)\n\n

现在我们可以用以下方法分割碎片

\n\n
m <- gregexpr("(\\\\w)<=>(.*?)(?:\\\\[=\\\\]|$)", V1, perl=T)\n
Run Code Online (Sandbox Code Playgroud)\n\n

获取 caputred 匹配可能会很麻烦,但我使用函数regcapturedmatches轻松获取所有匹配的数据。我使用它就像你使用内置的一样regmatches

\n\n
data <- regcapturedmatches(V1,m)\n
Run Code Online (Sandbox Code Playgroud)\n\n

然后,如果您检查,data您可以看到所有信息都在那里。现在的问题是我们只需要将其构建为列而不是像现在这样的行。为此,我使用reshape2

\n\n
library(reshape2)\n\n#combine list into one data.frame\nsdata<-do.call(rbind, lapply(1:length(data), \n    function(i) data.frame(data[[i]], S=i)))    \n\n#turn rows into columns\ndcast(sdata, S~X1, value.var="X2")\n
Run Code Online (Sandbox Code Playgroud)\n\n

然后返回

\n\n
  S    I             P   T              V     W      C           N     A     B\n1 1 1820  explorer.exe 161 6.00.2900.5512 20094   <NA>        <NA>  <NA>  <NA>\n2 2 1732   360Safe.exe 195  7, 5, 0, 1501 1017e 360.cn 360\xe5\xae\x89\xe5\x85\xa8\xe5\x8d\xab\xe5\xa3\xab  <NA>  <NA>\n3 3 1732   360Safe.exe 209  7, 5, 0, 1501 1017e   <NA>        <NA>  <NA>  <NA>\n4 4  436 360chrome.exe 211      5.2.0.804  <NA>   <NA>        <NA> 2027a 20290\n               U\n1           <NA>\n2           <NA>\n3           <NA>\n4 www.hao123.com\n
Run Code Online (Sandbox Code Playgroud)\n\n

您可以重命名列等,但实际上并没有那么多代码可以一次完成所有转换。

\n